CHAPTER 11Advanced ~50 min

Error Handling, Retry and Logging

Production-grade error handling: Error Trigger, retry strategies, alerts and logging.

In this chapter

The difference between a workflow that runs in test and one that runs reliably 24/7 in production is error handling. In the real world APIs sometimes return 500, the network drops, OpenAI rate-limits, the webhook source sends data a second late. A solid workflow doesn't go silent or crash; it retries, branches, or at least lets you know. In this chapter you'll learn n8n's four core mechanisms: node-level Retry On Fail and Continue On Fail, workflow-level On Error setting, the central Error Trigger workflow, and audit logs + alerts for observability.

Topics

On Error workflow setting
Error Trigger: central error workflow
Retry On Fail: delays and max tries
Continue On Fail: allowing partial success
Alert flows: Slack/Telegram error notifications
Logging and audit trail

Three-layer defence: Node, Workflow, Account

n8n lets you catch errors in three places. (1) Node level: 'Retry On Fail' retries the same node 3-5 times; 'Continue On Fail' moves on to the next step even if the node fails. (2) Workflow level: in workflow settings pick an 'Error Workflow' — if any node fails, that workflow is triggered. (3) Account level: define a default error workflow for all workflows. The right strategy: Retry on critical nodes, Continue + IF branch on nodes that talk to external services, and an Error Workflow set per workflow.

HTTP Request (Retry: 3x)

IF (success?)

Continue / Error Trigger

Retry On Fail: picking the right parameters

Every node has 'Settings → Retry On Fail.' Three values matter. Max Tries (3-5): how many attempts. Wait Between Tries (ms): the delay — instead of a constant 1000ms consider 'exponential backoff': 1s, 2s, 4s, 8s. n8n's built-in wait is constant but you can implement backoff with a Wait + Code node. Which errors to retry: not all — only transient ones (429 rate limit, 502/503/504 network, timeout). Retrying 400/401/404 is pointless — branch with IF and route to the Error Trigger.

Continue On Fail: allowing partial success

In a Loop of 100 items, don't let item 73 stop the whole workflow. 'Continue On Fail' makes the flow continue to the next item even when one fails. Important nuance: with 'Continue' enabled the node's output looks like { error: '...' } — without an IF in the next step checking 'is there an error?' you'll process the failing record as success too. Practical pattern: HTTP Request (Continue On Fail) → IF ($json.error ? error-branch : success-branch) → success rows to DB, failures to a 'failed_records' table.

Loop

HTTP Request (Continue)

IF (error?)

DB Insert / Failed Records Log

Error Trigger: a central error workflow

n8n's most powerful error mechanism is the Error Trigger node. Create a new workflow, place 'Error Trigger' as the first node, save it. Then in other workflows' settings select this new one as 'Error Workflow.' Now if any of their nodes fails, the Error Trigger fires and gives you: execution.id, execution.url (direct link to the Editor view of the failure), workflow.name, node.name, error message and timestamp. One error workflow can collect failures from all workflows in one place.

Error Trigger

Set (format message)

IF (severity)

Slack / Telegram / PagerDuty

Alert flow: which channel, when, how much?

Pick the alert channel by severity. Low (e.g. a single webhook 4xx): just log it, don't notify. Medium (5+ errors per hour): a summary to Slack/Telegram. High (DB down, AI Agent fully broken): direct phone — PagerDuty or a Telegram voice call. To prevent alert spam: dedupe — emit a given error at most once per 5 minutes (hash + Set + Redis/Postgres TTL). Every alert message must include: workflow name + node name + Editor URL (n8n.example.com/workflow/execution/<id>) + timestamp + sample error message. An empty 'workflow failed' alert helps nobody.

Logging and audit trail: what to write, where

n8n's own Executions screen is enough for the last 30 days but not for long-term audit. Practical fix: at the start and end of critical workflows insert a Postgres/MongoDB write → { workflow, execution_id, started_at, ended_at, status, input_hash, output_summary }. Have the error workflow write a 'failed' row to the same table too. Then 'how many runs in the last 24h and how many failed' is a single SQL query. For PII-likely inputs, log field names + hashes, not raw data. If compliance requires, store inputs/outputs in encrypted S3 and reference them from the table.

Dead Letter Queue pattern

Some errors can be fixed later manually (e.g. the CRM was down; the user should still be sent later). Set up a Dead Letter Queue (DLQ) table: a Postgres 'failed_jobs' with workflow_name, payload (JSON), error_message, retry_count, next_retry_at, status. On an Error Trigger event write a row. A separate Schedule Trigger workflow picks rows where next_retry_at is past and retry_count<5 every 10 minutes, replays them, and increments retry_count on failure. This pattern makes n8n resilient without a message queue.

Schedule (10 min)

Postgres SELECT failed_jobs

Loop

Replay Workflow

Update Status

Pre-production checklist: is your workflow ready?

Don't ship a workflow without ticking these 8 boxes. (1) Is Retry On Fail enabled on every external API call? (2) Do you know which errors retry and which branch through IF? (3) Has an Error Workflow been assigned in Settings? (4) Does the Error Workflow include channel + execution URL + workflow + node info? (5) On Loops: Continue On Fail + failed-records table in place? (6) Are timeouts and max payload sizes defined on Webhooks? (7) Are the credentials really production (not test API keys)? (8) Does the Executions retention setting (e.g. 30 days) plus your audit log retention meet your compliance needs? Pass this list once per workflow before flipping it Active.

This chapter's workflow (n8n editor view)

Error Trigger

Set

Slack

Next chapter