← All docs· Edit on GitHub ↗

Operations

What an operator / SOC analyst needs to run the platform day‑to‑day: console workflows, mitigation modes, runbooks for common incidents, on‑call escalation.

1. Console at a glance

Dashboard — KPI bar (active incidents, detect→mitigate p95, false positives 24 h,

sensor coverage), incident trend (24 h / 7 d / 30 d), active incidents table, network segments grid, sensor status panel. Polls every 5 s.

Incidents — Active / All / Archive tabs, full search & severity/status filters,

click → modal with attacker/victim/MITRE/tier/rules + Quarantine / Acknowledge / Release / Export Forensics / Add to Allow‑list actions.

Segments — per‑VLAN host & binding counts, ARP rate vs. baseline, active incidents.
Sensors — frames/24 h, p95 latency, VLAN coverage, online/degraded/offline.
Binding Database — IP↔MAC↔Port↔VLAN truth table; protected / baseline / learned.
Policies — per‑segment mitigation mode (MONITOR / GUARDED / ENFORCE).
Audit Log — append‑only mitigation history with timestamp, operator, action, result.
Users — admin only; create / delete / change role.
Settings — global automation flags, retention, SIEM/email integration.

Default demo accounts (rotate before going live): admin/ARPG_ADMIN_PASSWORD, analyst/analyst123, responder/responder123, viewer/viewer123.

2. Mitigation modes

Set per segment in Policies or globally via the correlator start flag (MODE=).

Mode	Auto‑acts	Use when
`MONITOR`	nothing (alert‑only)	Bring‑up, low confidence in baseline, audit periods
`GUARDED`	L1 corrective ARP	Default safe production — fixes the segment, doesn't quarantine
`ENFORCE`	L1 + L2 NAC quarantine	Hardened sites with working RADIUS CoA / NAC actuator

Auto‑mitigation needs a deterministic Tier‑1 or Tier‑2 hit. Tier‑3 anomaly or ML‑shadow findings never auto‑act — they raise severity and require an analyst to push Quarantine. TTL auto‑revert returns the segment to its prior state if the threat doesn't recur.

3. Triage runbook (most common)

Symptom: A CRITICAL incident appears on the Dashboard.

Open the incident modal. Read the trigger rules and the claimed IP/MAC.
Cross‑check against Binding Database — is the attacker MAC known and protected,

or is the IP a gateway that's supposed to be at a different MAC?

If the incident is genuine:

In GUARDED/ENFORCE, the controller has already taken L1 — confirm in

Audit Log that the corrective ARP went out.

Click Quarantine to escalate to L2 if NAC is wired up.
Capture forensics with Export Forensics (downloads JSON bundle with incident +

detections + audit rows).

If it's a false positive:

Click Add to Allow‑list — the attacker's IP↔MAC pair becomes an approved binding.
Update vmac_allowlist directly via SQL if it's an HA / cluster vMAC.
Acknowledge the incident.

4. Runbook: false positive on legitimate gateway ARP

Almost always caused by a stale lab baseline. The real gateway MAC must match what's in bindings.

# Find the real MAC
arp -a | grep 192.168.10.1
# Or from the sensor host:
ip neigh show 192.168.10.1

# Sync baseline (control/baseline_sync.py reads bindings table; edit there first if needed)
psql -h 127.0.0.1 -U postgres -d arpg -c \
  "UPDATE bindings SET mac = '<REAL_MAC>' WHERE ip = '192.168.10.1' AND vlan = 10;"
python3 backend/control/baseline_sync.py

Once the binding is correct, acknowledge the open incident.

5. Runbook: sensor offline

Check Sensors panel — is it offline? When was it last seen?
On the host: systemctl status arpg-sensor (or your make sensor-* foreground).
Common causes:

Caps dropped after rebuild → make caps.
NIC not promiscuous → sudo ip link set ens33 promisc on.
JetStream consumer wedged → restart the correlator (it bootstraps with

DeleteConsumer + DeliverNew).

6. Runbook: detect→mitigate p95 above SLA

SLA target is <100 ms. The latency probe (make latency-probe) measures end‑to‑end via api/latency. If p95 climbs:

NATS health: make stack should show the container healthy. Restart only if needed.
PostgreSQL connections: pgxpool defaults; check no long‑running queries are holding

the incidents/detections tables. SELECT pid, query, state FROM pg_stat_activity.

Correlator backlog: Prometheus metric arpg_correlator_queue on :9110/metrics.
Sensor: arpg-sensor bench 2000000 should report ~0.9 µs/frame on lab hardware. A

regression here points at the hot path; check that target/release is current and that caps are still set.

7. SOC onboarding

Day‑0 checklist for a new analyst:

Account provisioned by an admin (Users page); role = analyst for read+ack/quarantine,

responder for binding writes, admin for users/policies/settings.

Walk through the Dashboard, click a sample incident, run Export Forensics to see the

bundle.

Run a benign poison from backend/generator/arp_attack.py against a test segment and

watch the full triage cycle.

Review SECURITY.md — the safety invariants, what auto‑acts and what doesn't.

7a. Notifications (Telegram)

Real-time alerts to a Telegram chat. Bot token is configured at the API process via the ARPG_TELEGRAM_BOT_TOKEN environment variable; with no token the dispatcher still runs but skips sending.

# At API start-up
ARPG_TELEGRAM_BOT_TOKEN=123456:AA…   \
ARPG_DASHBOARD_URL=https://arpg.example.com \
./arpg-api --addr :8080

Operator onboarding (per analyst):

Open Telegram, find the bot (@yourbotname), send /start — the bot replies

with the chat's chat_id.

Easy path: in the dashboard go to **Settings → Notifications → Generate

link code**, copy the 6-char code, send /link CODE to the bot. The chat is registered automatically against the operator's user account.

Manual path: paste the chat_id from /start into **Add channel → Chat

ID** with severity threshold (default CRITICAL) and optional quiet hours.

Click Test — a test message should arrive immediately. If it doesn't, the

row's Last error column tells you why (firewalled egress, bot kicked, etc.).

Coalescing: per (channel, attacker_mac) we coalesce to one message per 60s to survive incident storms. Status-change events (closed) always pass.

Spam guard tuning: change CoalesceWindow in backend/api/alerts/service.go if 60s is wrong for your traffic.

Channel commands (in a chat with the bot):

Command	What it does
`/start`	Reply with this chat's `chat_id`
`/link CODE`	Bind this chat to your dashboard user (one-time code)
`/stop`	Disable all alerts going to this chat
`/whoami`	List channels registered for this chat
`/help`	Command list

Rotation: the bot token lives only in the process env, never in DB or source. To rotate, revoke via @BotFather, restart arpg-api with the new token, registered chat_ids keep working unchanged.

8. Observability

Prometheus scrapes the correlator at :9110/metrics and the host node_exporter.
Grafana dashboard uid arpg-overview (provisioning in infra/grafana/). Login

admin / arpg_admin.

SIEM export via backend/control/siem_connector.py — CEF + ECS over syslog. Forward

to your existing collector.

9. Routine maintenance

Cadence	Task
Daily	Triage active incidents; review FP rate trend
Weekly	Confirm sensor heartbeats; review binding churn; rotate JWT secret if compromised
Monthly	Re‑train ML shadow model (`make ml-train`); sanity‑check eval scores
Per change	Run `make eval` after baseline / rule changes; record measured precision/recall

10. Escalation

Sensor outage > 5 min on a protected segment → page network on‑call.
Detect→mitigate p95 > 500 ms sustained → page platform on‑call.
Auto‑mitigation circuit breaker tripped (more than N actions/minute) → analyst

reviews mitigation_audit immediately; do not bypass the breaker.