Incidents

An incident is what Tallwatch opens when a monitor’s regions agree it’s down. It’s the unit your alerts, your status page, and your on-call rotation all hang off. This page covers its lifecycle and the cases where Tallwatch deliberately stays quiet.

Lifecycle

An incident moves through three states:

Open

Tallwatch decides the monitor is down once more than one region agrees, and opens the incident. The alert goes to the channels on the monitor’s escalation policy. A monitor has at most one open incident at a time.

Acknowledged

Someone acknowledges the incident from its detail page, signalling that a human is on it. Acknowledging is for coordination; it doesn’t change whether the monitor is up or down. It fires an incident.acknowledged event to the monitor’s channels (a PagerDuty acknowledge, a Slack, Discord, or Teams update, a webhook), so the rest of the team knows someone has it.

Resolved

The incident closes, either automatically or by hand (below). The notifier sends the resolved alert.

Every transition is recorded as an event on the incident, so its detail page reads as a timeline: opened, who acknowledged, the regions that voted it down, which channels were notified, when it resolved.

The incident detail page

Beyond the lifecycle, the detail page is where you work an incident:

A timeline of events and comments. Post Markdown comments to discuss the incident, and @mention a teammate to pull them in. The system’s events (opened, acknowledged, per-region down and recovery, dispatches, resolved) interleave with your comments in one chronological feed.
A metadata table with the monitor, its type, the target, and the failing check details.
Per-region evidence. The regions that voted the monitor down, with the error class and status code each saw, plus a derived likely cause.
Replay. Re-run the monitor’s check on demand from the Tallwatch server to see its current result. A replay is a one-off diagnostic, not a regional vote, and isn’t recorded against the monitor.
Rename and remove. Give the incident a custom title, or delete it once it’s resolved (resolved incidents only; the delete is a soft delete).
Alert delivery. Each dispatch records the exact message that was sent, so you can open a dispatch and see what landed in Slack, a webhook, or an inbox, on both success and failure.

Automated vs manual resolution

Automated. Consensus is the source of truth for a monitor’s state. Once enough regions agree the check is healthy again, the incident resolves on its own and the resolved alert fires. No timer to configure; recovery is just consensus running in reverse. Manual. If you’ve fixed the problem and don’t want to wait for the next checks to confirm, resolve the incident from its page. To stop consensus from immediately reopening it on the same batch of still-failing probes, a manual resolve sets a five-minute cooldown on the monitor. During that window consensus won’t open a fresh incident, which gives recovery time to show up in the data. Acknowledging and resolving aren’t dashboard-only. With a write-scoped API key you can ack or resolve an incident from a script, from CI, or by asking an AI agent through the MCP server. Both actions take an optional note that lands on the timeline.

Post-mortems

After an incident resolves, you can keep a post-mortem on its detail page: a Markdown write-up of what happened and what you changed. Write it yourself, or click Generate with AI to draft one from the incident’s own evidence (the regions that failed, the probe transitions, the timeline) using Google Gemini, then edit it before you save. Either way it’s attached to the internal incident record, separate from anything you publish to a status page, so you can keep the candid internal account and the public update distinct.

AI generation needs a Gemini key configured on the backend. On a deployment without one, the Generate with AI action returns an error and you can still write the post-mortem by hand.

When alerts are suppressed

Sometimes the right behavior is silence. Two cases suppress alerts on purpose:

Maintenance windows. Inside a scheduled maintenance window, consensus doesn’t open incidents at all, so nothing fires.
Dependency down. If a monitor lists a dependency that’s already down, its incident still opens but the alert is suppressed and recorded as suppressed. You get paged about the upstream cause, not every downstream symptom.

In both cases the suppression is visible in the record, so nothing is hidden. The alert was held back on purpose, and you can see that it was.

Monitors

Alerts

Status pages

Developers

Lifecycle

The incident detail page

Automated vs manual resolution

Post-mortems

When alerts are suppressed

​Lifecycle

​The incident detail page

​Automated vs manual resolution

​Post-mortems

​When alerts are suppressed

Lifecycle

The incident detail page

Automated vs manual resolution

Post-mortems

When alerts are suppressed