Error Handling and Recovery

Robots operate in the real world, where things go wrong. Sensors fail, wheels slip, obstacles appear, batteries die. A well-designed robot doesn't crash — it recovers gracefully.

Types of Failures

Robot failures fall into three categories:

1. Expected Failures (Recoverable)

These are failures you plan for:

Battery runs low → navigate to charger
Path is blocked → replan route
Gripper fails to grasp → retry with adjusted force

How to handle: Build recovery into your state machine or BT. Add transitions or fallback nodes.

2. Unexpected Failures (Potentially Recoverable)

These are failures you didn't anticipate but can detect:

Sensor returns garbage data → use backup sensor or last known value
Motor stalls unexpectedly → retry with lower speed
Network connection drops → switch to local mode

How to handle: Use watchdogs, timeouts, and health monitors to detect anomalies. Transition to a safe state and attempt recovery.

3. Fatal Failures (Not Recoverable)

These require human intervention:

Motor controller burns out
Camera is physically damaged
Robot tips over and can't right itself

How to handle: Transition to an ERROR state, stop all motion, log diagnostics, and alert a human.

Error States in State Machines

A common pattern is to add an ERROR state that any other state can transition to:

State machine with error handling

The key is ANY_STATE → ERROR — from any state, if a critical failure occurs, transition to ERROR.

In the ERROR state:

Stop all motors
Log telemetry
Publish error status
Flash an LED or play a sound
Wait for manual reset

Warning

Never try to continue the mission after a fatal error without human approval. A robot that ignores errors can damage itself, its environment, or people nearby.

Fallback Behaviors in Behavior Trees

Behavior trees handle errors with fallback nodes (another name for selectors). If the first child fails, the selector tries the next child.

Behavior tree with fallback

How it works:

Try the main path. If it's blocked → FAILURE.
Try the alternate path. If it's also blocked → FAILURE.
Fall back to requesting human assistance.

This pattern ensures the robot always has a fallback, even if the fallback is "ask for help."

Watchdogs

A watchdog is a timer that expects to be "reset" periodically. If the timer expires (because the main loop froze or a task hung), the watchdog triggers a safety action.

Watchdog example

If the main loop hangs (infinite loop, deadlock, etc.), the watchdog detects it and triggers a safe shutdown.

Hardware watchdogs (on embedded systems) can reset the entire processor if software fails to reset the timer.

Timeouts

A timeout is a maximum time allowed for a task. If the task takes too long, it's considered failed.

Timeout decorator in behavior tree

If DockAtCharger takes more than 5 seconds, the timeout decorator returns FAILURE, allowing the BT to try a fallback (e.g., retry docking or request help).

Timeouts prevent the robot from getting stuck waiting for something that will never happen.

Graceful Degradation

Graceful degradation means the robot continues operating (at reduced capability) when a non-critical component fails.

Examples:

Camera fails → switch to LiDAR-only navigation (slower but still works)
One wheel encoder fails → estimate position using the other three wheels (less accurate but usable)
Wi-Fi drops → cache commands locally and resync when connection returns

The key is to identify critical vs non-critical components:

Critical (mission abort if lost)	Non-critical (degrade gracefully)
Motor controllers	Camera (if LiDAR is available)
Battery	Wi-Fi (if operating autonomously)
Emergency stop button	Speaker (for status messages)

Design your robot so that losing a non-critical component triggers a degraded mode (e.g., "NAVIGATING_DEGRADED") rather than an ERROR state.

Retry Logic

Sometimes the best recovery is to try again. A retry decorator in a BT:

Retry decorator

This retries GraspObject up to 3 times. If all attempts fail, it returns FAILURE, allowing the BT to try a fallback (e.g., "ask human to place object in gripper").

When to retry:

Transient failures (network glitch, momentary sensor noise)
Operations with random elements (grasping, docking)

When NOT to retry:

Fatal hardware failures (retrying a burned-out motor won't help)
Deterministic failures (if the path is blocked, retrying navigation without replanning is pointless)

Tip

Add exponential backoff to retries: wait 1 second, then 2 seconds, then 4 seconds. This prevents hammering a failing component and gives transient issues time to resolve.

Logging and Diagnostics

When an error occurs, log everything:

Current state / active BT node
Sensor readings
Recent transitions
Stack trace (if applicable)

This data is critical for debugging. Without it, you're guessing why the robot failed.

Many robotics frameworks use a ring buffer — a fixed-size log that overwrites old entries. This ensures you always have the last N seconds of data before the crash, without filling the disk.

What's Next?

You've learned the foundations of robot decision-making: reactive vs deliberative behavior, state machines, behavior trees, and error handling. These tools form the "brain" of your robot — the logic that decides what to do next.

In the next module, we'll explore mapping and localization — how robots know where they are and build maps of their environment.

Error Handling and Recovery

Error Handling and Recovery

Types of Failures

1. Expected Failures (Recoverable)

2. Unexpected Failures (Potentially Recoverable)

3. Fatal Failures (Not Recoverable)

Error States in State Machines

Fallback Behaviors in Behavior Trees

Watchdogs

Timeouts

Graceful Degradation

Retry Logic

Logging and Diagnostics

What's Next?

Discussion