Learn Robotics
Module: See The World

Object Detection Basics

What object detection does, how bounding boxes work, what machine learning contributes, and practical tips for using detectors in robotics.

12 min read

Object Detection Basics

Sensors give you raw data — pixels, points, distances. Object detection turns that data into understanding: "There's a person at (2m, 0.5m), a chair at (1m, -1m), and a cup on the table."

This is the bridge between sensing and decision-making.

What Is Object Detection?

Object detection answers two questions simultaneously:

  1. What is in the image? (classification)
  2. Where is it? (localization)

The output is a list of detections, each containing:

  • Class: "person", "car", "dog", "cup"
  • Bounding box: Rectangle around the object (x, y, width, height)
  • Confidence score: How certain the detector is (0.0 to 1.0)
Detection Output Structure

Bounding Boxes

A bounding box is the smallest rectangle that fully contains an object. It's defined by:

  • Top-left corner: (x, y) in pixel coordinates
  • Size: width and height in pixels

Some formats use:

  • (x_min, y_min, x_max, y_max) — two corners
  • (x_center, y_center, width, height) — center and size
Note

Bounding boxes are axis-aligned — they can't rotate. For a tilted object (like a book lying diagonally), the bbox includes empty space around it. For tighter fits, use rotated bounding boxes or segmentation masks.

Visualizing Detections

Drawing Bounding Boxes

What Machine Learning Does

You could write rules for detection:

  • "If there's a red circle, it's a stop sign"
  • "If there's a face-shaped blob, it's a person"

But this breaks down fast. What about stop signs in shadows? People wearing masks? Objects partially hidden?

Machine learning learns patterns from thousands of labeled examples:

  • Train on 10,000 images of cars from every angle, lighting, weather
  • The model learns "car-ness" — shapes, textures, contexts
  • It generalizes to new images it's never seen

Popular Object Detection Models

ModelSpeedAccuracyUse Case
YOLO (You Only Look Once)🚀 FastGoodReal-time robotics (30+ FPS)
SSD (Single Shot Detector)FastGoodMobile/embedded devices
Faster R-CNNSlowExcellentHigh-precision tasks (inspection, medical)
EfficientDetMediumExcellentBalanced accuracy/speed
Tip

For robots, YOLO is the go-to choice. It's fast enough for real-time video (60+ FPS on a GPU, 10+ FPS on CPU), accurate enough for most tasks, and has tons of pre-trained models available.

Confidence Scores and Thresholds

Every detection has a confidence score — the model's certainty that the detection is correct.

  • 0.95: Very confident (almost certainly correct)
  • 0.70: Moderately confident (probably correct)
  • 0.30: Low confidence (might be a false positive)

You set a threshold to filter detections:

  • High threshold (0.8+): Fewer false positives, but might miss real objects
  • Low threshold (0.5): Catch more objects, but more false alarms

The right threshold depends on your task:

  • Picking objects: High threshold (you don't want to grasp empty air)
  • Obstacle avoidance: Lower threshold (better to slow down for a false alarm than crash)
Tuning Confidence Thresholds

Common Pitfalls

1. False Positives

The model detects something that isn't there:

  • Shadow looks like a person
  • Reflection in glass triggers a detection
  • Pattern on a shirt mistaken for a logo

Fix: Increase confidence threshold, use temporal filtering (require detection in 3+ consecutive frames).

2. False Negatives

The model misses real objects:

  • Object partially hidden
  • Unusual angle or lighting
  • Object not in the training data

Fix: Lower threshold, add more training data, use multiple camera angles.

3. Duplicate Detections

Same object detected twice with overlapping bboxes.

Fix: Apply Non-Maximum Suppression (NMS) — merge overlapping boxes, keep only the highest-confidence one.

What's Next?

Detection gives you objects in the image, but robots need 3D positions to interact with the world. The next lesson covers sensor fusion — combining camera detections with depth data (from LiDAR, stereo, etc.) to localize objects in 3D space.

Got questions? Join the community

Discuss this lesson, get help, and connect with other learners on Discord.

Join Discord

Discussion

Sign in to join the discussion.