Failure classification
When a node fails, Fabro classifies the failure into one of six categories. These classes drive retry decisions, circuit breaker logic, and edge routing.| Class | Description | Examples |
|---|---|---|
transient_infra | Temporary infrastructure problem — likely to resolve on retry | Rate limits, timeouts, network errors, 5xx responses |
deterministic | Permanent failure — retrying won’t help | Authentication errors, bad configuration, invalid requests |
budget_exhausted | Resource limit reached | Context length exceeded, token/turn limits, quota exhausted |
compilation_loop | Reserved for loop detection | — |
canceled | User or system cancellation | Cancel signal, abort |
structural | Reserved for scope enforcement | Write scope violations |
failure_class is written to context after each stage, so you can route on it in edge conditions:
Retry layers
Fabro retries failures at two levels: LLM retries handle transient API errors inside a single model call, and node retries re-execute the entire node handler when the first level isn’t enough. These layers are independent — a node retry re-runs the full handler, which gets its own fresh set of LLM retries.LLM retries
Every LLM call (within an agent session or a one-shot prompt node) has a built-in retry loop for transient API errors. This is invisible to the workflow — it happens inside the model call itself.| Setting | Default |
|---|---|
| Max retries | 3 |
| Initial delay | 1 second |
| Backoff multiplier | 2x |
| Max delay | 60 seconds |
| Jitter | 0.5x–1.5x random factor |
Retry-After header, Fabro respects it — unless the delay exceeds 60 seconds, in which case the call fails rather than blocking the run.
Node retries
When a node handler fails (after LLM retries are exhausted), the engine can retry the entire node. This is controlled by retry policies.Retry policies
Set a retry policy on a node with theretry_policy attribute:
| Policy | Max attempts | Initial delay | Backoff | Typical delays |
|---|---|---|---|---|
none | 1 | — | — | No retries |
standard | 5 | 200ms | 2x exponential | 200ms, 400ms, 800ms, 1.6s |
aggressive | 5 | 500ms | 2x exponential | 500ms, 1s, 2s, 4s |
linear | 3 | 500ms | 1x (constant) | 500ms, 500ms |
patient | 3 | 2s | 3x exponential | 2s, 6s |
Setting retries without a policy
You can also set just the retry count usingmax_retries:
Resolution order
The engine resolves retry configuration in this order:- Node attribute
retry_policy— named preset - Node attribute
max_retries— count only, default backoff - Graph attribute
default_max_retry— applies to all nodes without explicit config (default: 3)
What gets retried
Not all errors trigger a node retry. The handler’sshould_retry check must return true — generally, only errors classified as transient are retried. Deterministic errors (auth failures, bad config) fail immediately without consuming retry attempts.
When a handler returns a Retry status instead of Fail, retries always proceed (if attempts remain). If retries are exhausted and the node has allow_partial=true, the outcome is promoted to PartialSuccess instead of failing.
Model fallbacks
When a model provider fails with a transient error or quota exhaustion, Fabro can automatically switch to a different provider. Configure fallback chains in your run configuration:run.toml
What triggers failover
Failover is a superset of LLM retry eligibility:| Error type | LLM retry | Provider failover |
|---|---|---|
| Rate limit | Yes | Yes |
| Server error (5xx) | Yes | Yes |
| Timeout / network | Yes | Yes |
| Quota exceeded | No | Yes |
| Authentication (401) | No | No |
| Invalid request (400) | No | No |
| Context length (413) | No | No |
| Content filter | No | No |
Loop detection
Fabro has two independent mechanisms for detecting stuck loops: node visit limits that catch workflow-level cycles, and tool call pattern detection that catches agent-level repetition.Node visit limits
Themax_node_visits graph attribute sets the maximum number of times any single node can execute before the run is terminated:
example.fabro
| Context | Default |
|---|---|
| Normal runs | Disabled (unlimited) |
Dry runs (--dry-run) | 10 |
Explicit max_node_visits | The configured value |
Per-node overrides
You can setmax_visits on individual nodes to override the graph-level limit for that node:
example.fabro
max_visits takes precedence over max_node_visits (and the dry-run default of 10). This is useful when specific nodes — like a fix-and-verify loop — should have a tighter limit than the rest of the workflow:
Tool call loop detection
Inside an agent session, Fabro monitors the last 10 assistant turns for repeating tool call patterns. It detects patterns of length 1 (same call repeated), 2 (A-B-A-B), or 3 (A-B-C-A-B-C). Every complete group in the window must match for detection to trigger. When a loop is detected, Fabro injects a steering message into the conversation:WARNING: Loop detected. You appear to be repeating the same tool calls. Please try a different approach or ask for clarification.This gives the agent a chance to break out of the loop without failing the node.
Failure signatures and circuit breakers
Failure signatures are a deduplication mechanism that prevents the same failure from recurring indefinitely across loop iterations. They are particularly important for workflows with retry loops (implement → verify → fix → verify → …).How signatures work
After each failed node, Fabro constructs a failure signature — a normalized fingerprint combining the node ID, failure class, and error message:<hex>, replacing digits with <n>, and truncating to 240 characters. This groups failures with the same root cause even when details like line numbers or timestamps vary.
Circuit breaker
Fabro tracks signature counts across the run. When the same signature repeats 3 times (configurable vialoop_restart_signature_limit), the run is terminated:
deterministic and structural failures are tracked — transient failures are excluded because they may genuinely resolve on retry.
Failure signature counts are never reset on success. This is intentional — it prevents cycles like “implement succeeds → verify fails → fix → implement succeeds → verify fails” from running indefinitely.
Loop restart edges
Edges marked withloop_restart=true trigger a special restart of the workflow from the target node. These have an additional guard: only transient_infra failures may cross a loop_restart edge. If the failure class is anything else, the run is terminated:
restart_failure_signatures) that enforces the same signature limit.
Goal gates
Goal gates are quality checkpoints that are enforced when the workflow reaches an exit node. A node marked withgoal_gate=true must have completed with success or partial_success — otherwise the run cannot finish.
- Failed node’s
retry_targetattribute - Failed node’s
fallback_retry_targetattribute - Graph-level
retry_targetattribute - Graph-level
fallback_retry_targetattribute
example.fabro
Stall watchdog
Fabro runs a background watchdog that monitors event activity. If no events are emitted for longer than the stall timeout, the run is canceled. This catches cases where a handler hangs indefinitely without producing errors.| Setting | Default |
|---|---|
stall_timeout | 1800 seconds (30 minutes) |
Set to 0 | Disables the watchdog |
example.fabro
When failures become fatal
A node failure does not automatically terminate the run. Fabro follows this escalation path:- LLM retries — transient API errors are retried inside the model call (up to 3 retries)
- Provider failover — if configured, switch to a fallback provider
- Node retries — re-execute the entire handler (per the retry policy)
- Edge routing — if the node ultimately fails, look for an outgoing edge that matches (e.g.,
condition="outcome=fail") - Retry target — if no matching edge exists, check
retry_target/fallback_retry_targeton the node and graph - Run failure — if none of the above produces a path forward, the run terminates
- Node visit limit exceeded — a node has been visited too many times
- Circuit breaker tripped — the same failure signature has repeated too many times
- Loop restart blocked — a non-transient failure tried to cross a
loop_restartedge - Goal gate failure with no retry target — a required gate was unsatisfied at the exit node
- Stall timeout — no events for too long
- Cancellation — user or system cancel signal