Error Handling and Recovery
Robots operate in the real world, where things go wrong. Sensors fail, wheels slip, obstacles appear, batteries die. A well-designed robot doesn't crash — it recovers gracefully.
Types of Failures
Robot failures fall into three categories:
1. Expected Failures (Recoverable)
These are failures you plan for:
- Battery runs low → navigate to charger
- Path is blocked → replan route
- Gripper fails to grasp → retry with adjusted force
How to handle: Build recovery into your state machine or BT. Add transitions or fallback nodes.
2. Unexpected Failures (Potentially Recoverable)
These are failures you didn't anticipate but can detect:
- Sensor returns garbage data → use backup sensor or last known value
- Motor stalls unexpectedly → retry with lower speed
- Network connection drops → switch to local mode
How to handle: Use watchdogs, timeouts, and health monitors to detect anomalies. Transition to a safe state and attempt recovery.
3. Fatal Failures (Not Recoverable)
These require human intervention:
- Motor controller burns out
- Camera is physically damaged
- Robot tips over and can't right itself
How to handle: Transition to an ERROR state, stop all motion, log diagnostics, and alert a human.
Error States in State Machines
A common pattern is to add an ERROR state that any other state can transition to:
The key is ANY_STATE → ERROR — from any state, if a critical failure occurs, transition to ERROR.
In the ERROR state:
- Stop all motors
- Log telemetry
- Publish error status
- Flash an LED or play a sound
- Wait for manual reset
Never try to continue the mission after a fatal error without human approval. A robot that ignores errors can damage itself, its environment, or people nearby.
Fallback Behaviors in Behavior Trees
Behavior trees handle errors with fallback nodes (another name for selectors). If the first child fails, the selector tries the next child.
How it works:
- Try the main path. If it's blocked → FAILURE.
- Try the alternate path. If it's also blocked → FAILURE.
- Fall back to requesting human assistance.
This pattern ensures the robot always has a fallback, even if the fallback is "ask for help."
Watchdogs
A watchdog is a timer that expects to be "reset" periodically. If the timer expires (because the main loop froze or a task hung), the watchdog triggers a safety action.
If the main loop hangs (infinite loop, deadlock, etc.), the watchdog detects it and triggers a safe shutdown.
Hardware watchdogs (on embedded systems) can reset the entire processor if software fails to reset the timer.
Timeouts
A timeout is a maximum time allowed for a task. If the task takes too long, it's considered failed.
If DockAtCharger takes more than 5 seconds, the timeout decorator returns FAILURE, allowing the BT to try a fallback (e.g., retry docking or request help).
Timeouts prevent the robot from getting stuck waiting for something that will never happen.
Graceful Degradation
Graceful degradation means the robot continues operating (at reduced capability) when a non-critical component fails.
Examples:
- Camera fails → switch to LiDAR-only navigation (slower but still works)
- One wheel encoder fails → estimate position using the other three wheels (less accurate but usable)
- Wi-Fi drops → cache commands locally and resync when connection returns
The key is to identify critical vs non-critical components:
| Critical (mission abort if lost) | Non-critical (degrade gracefully) |
|---|---|
| Motor controllers | Camera (if LiDAR is available) |
| Battery | Wi-Fi (if operating autonomously) |
| Emergency stop button | Speaker (for status messages) |
Design your robot so that losing a non-critical component triggers a degraded mode (e.g., "NAVIGATING_DEGRADED") rather than an ERROR state.
Retry Logic
Sometimes the best recovery is to try again. A retry decorator in a BT:
This retries GraspObject up to 3 times. If all attempts fail, it returns FAILURE, allowing the BT to try a fallback (e.g., "ask human to place object in gripper").
When to retry:
- Transient failures (network glitch, momentary sensor noise)
- Operations with random elements (grasping, docking)
When NOT to retry:
- Fatal hardware failures (retrying a burned-out motor won't help)
- Deterministic failures (if the path is blocked, retrying navigation without replanning is pointless)
Add exponential backoff to retries: wait 1 second, then 2 seconds, then 4 seconds. This prevents hammering a failing component and gives transient issues time to resolve.
Logging and Diagnostics
When an error occurs, log everything:
- Current state / active BT node
- Sensor readings
- Recent transitions
- Stack trace (if applicable)
This data is critical for debugging. Without it, you're guessing why the robot failed.
Many robotics frameworks use a ring buffer — a fixed-size log that overwrites old entries. This ensures you always have the last N seconds of data before the crash, without filling the disk.
What's Next?
You've learned the foundations of robot decision-making: reactive vs deliberative behavior, state machines, behavior trees, and error handling. These tools form the "brain" of your robot — the logic that decides what to do next.
In the next module, we'll explore mapping and localization — how robots know where they are and build maps of their environment.