Object Detection Basics
Sensors give you raw data — pixels, points, distances. Object detection turns that data into understanding: "There's a person at (2m, 0.5m), a chair at (1m, -1m), and a cup on the table."
This is the bridge between sensing and decision-making.
What Is Object Detection?
Object detection answers two questions simultaneously:
- What is in the image? (classification)
- Where is it? (localization)
The output is a list of detections, each containing:
- Class: "person", "car", "dog", "cup"
- Bounding box: Rectangle around the object (x, y, width, height)
- Confidence score: How certain the detector is (0.0 to 1.0)
Bounding Boxes
A bounding box is the smallest rectangle that fully contains an object. It's defined by:
- Top-left corner: (x, y) in pixel coordinates
- Size: width and height in pixels
Some formats use:
- (x_min, y_min, x_max, y_max) — two corners
- (x_center, y_center, width, height) — center and size
Bounding boxes are axis-aligned — they can't rotate. For a tilted object (like a book lying diagonally), the bbox includes empty space around it. For tighter fits, use rotated bounding boxes or segmentation masks.
Visualizing Detections
What Machine Learning Does
You could write rules for detection:
- "If there's a red circle, it's a stop sign"
- "If there's a face-shaped blob, it's a person"
But this breaks down fast. What about stop signs in shadows? People wearing masks? Objects partially hidden?
Machine learning learns patterns from thousands of labeled examples:
- Train on 10,000 images of cars from every angle, lighting, weather
- The model learns "car-ness" — shapes, textures, contexts
- It generalizes to new images it's never seen
Popular Object Detection Models
| Model | Speed | Accuracy | Use Case |
|---|---|---|---|
| YOLO (You Only Look Once) | 🚀 Fast | Good | Real-time robotics (30+ FPS) |
| SSD (Single Shot Detector) | Fast | Good | Mobile/embedded devices |
| Faster R-CNN | Slow | Excellent | High-precision tasks (inspection, medical) |
| EfficientDet | Medium | Excellent | Balanced accuracy/speed |
For robots, YOLO is the go-to choice. It's fast enough for real-time video (60+ FPS on a GPU, 10+ FPS on CPU), accurate enough for most tasks, and has tons of pre-trained models available.
Confidence Scores and Thresholds
Every detection has a confidence score — the model's certainty that the detection is correct.
- 0.95: Very confident (almost certainly correct)
- 0.70: Moderately confident (probably correct)
- 0.30: Low confidence (might be a false positive)
You set a threshold to filter detections:
- High threshold (0.8+): Fewer false positives, but might miss real objects
- Low threshold (0.5): Catch more objects, but more false alarms
The right threshold depends on your task:
- Picking objects: High threshold (you don't want to grasp empty air)
- Obstacle avoidance: Lower threshold (better to slow down for a false alarm than crash)
Common Pitfalls
1. False Positives
The model detects something that isn't there:
- Shadow looks like a person
- Reflection in glass triggers a detection
- Pattern on a shirt mistaken for a logo
Fix: Increase confidence threshold, use temporal filtering (require detection in 3+ consecutive frames).
2. False Negatives
The model misses real objects:
- Object partially hidden
- Unusual angle or lighting
- Object not in the training data
Fix: Lower threshold, add more training data, use multiple camera angles.
3. Duplicate Detections
Same object detected twice with overlapping bboxes.
Fix: Apply Non-Maximum Suppression (NMS) — merge overlapping boxes, keep only the highest-confidence one.
What's Next?
Detection gives you objects in the image, but robots need 3D positions to interact with the world. The next lesson covers sensor fusion — combining camera detections with depth data (from LiDAR, stereo, etc.) to localize objects in 3D space.