Operations

What an operator / SOC analyst needs to run the platform day‑to‑day: console workflows, mitigation modes, runbooks for common incidents, on‑call escalation.

1. Console at a glance

  • Dashboard — KPI bar (active incidents, detect→mitigate p95, false positives 24 h,
  • sensor coverage), incident trend (24 h / 7 d / 30 d), active incidents table, network segments grid, sensor status panel. Polls every 5 s.

  • IncidentsActive / All / Archive tabs, full search & severity/status filters,
  • click → modal with attacker/victim/MITRE/tier/rules + Quarantine / Acknowledge / Release / Export Forensics / Add to Allow‑list actions.

  • Segments — per‑VLAN host & binding counts, ARP rate vs. baseline, active incidents.
  • Sensors — frames/24 h, p95 latency, VLAN coverage, online/degraded/offline.
  • Binding Database — IP↔MAC↔Port↔VLAN truth table; protected / baseline / learned.
  • Policies — per‑segment mitigation mode (MONITOR / GUARDED / ENFORCE).
  • Audit Log — append‑only mitigation history with timestamp, operator, action, result.
  • Users — admin only; create / delete / change role.
  • Settings — global automation flags, retention, SIEM/email integration.

Default demo accounts (rotate before going live): admin/ARPG_ADMIN_PASSWORD, analyst/analyst123, responder/responder123, viewer/viewer123.

2. Mitigation modes

Set per segment in Policies or globally via the correlator start flag (MODE=).

Mode Auto‑acts Use when
MONITOR nothing (alert‑only) Bring‑up, low confidence in baseline, audit periods
GUARDED L1 corrective ARP Default safe production — fixes the segment, doesn't quarantine
ENFORCE L1 + L2 NAC quarantine Hardened sites with working RADIUS CoA / NAC actuator

Auto‑mitigation needs a deterministic Tier‑1 or Tier‑2 hit. Tier‑3 anomaly or ML‑shadow findings never auto‑act — they raise severity and require an analyst to push Quarantine. TTL auto‑revert returns the segment to its prior state if the threat doesn't recur.

3. Triage runbook (most common)

Symptom: A CRITICAL incident appears on the Dashboard.

  1. Open the incident modal. Read the trigger rules and the claimed IP/MAC.
  2. Cross‑check against Binding Database — is the attacker MAC known and protected,
  3. or is the IP a gateway that's supposed to be at a different MAC?

  4. If the incident is genuine:
    • In GUARDED/ENFORCE, the controller has already taken L1 — confirm in
    • Audit Log that the corrective ARP went out.

    • Click Quarantine to escalate to L2 if NAC is wired up.
    • Capture forensics with Export Forensics (downloads JSON bundle with incident +
    • detections + audit rows).

  5. If it's a false positive:
    • Click Add to Allow‑list — the attacker's IP↔MAC pair becomes an approved binding.
    • Update vmac_allowlist directly via SQL if it's an HA / cluster vMAC.
    • Acknowledge the incident.

4. Runbook: false positive on legitimate gateway ARP

Almost always caused by a stale lab baseline. The real gateway MAC must match what's in bindings.

# Find the real MAC
arp -a | grep 192.168.10.1
# Or from the sensor host:
ip neigh show 192.168.10.1

# Sync baseline (control/baseline_sync.py reads bindings table; edit there first if needed)
psql -h 127.0.0.1 -U postgres -d arpg -c \
  "UPDATE bindings SET mac = '<REAL_MAC>' WHERE ip = '192.168.10.1' AND vlan = 10;"
python3 backend/control/baseline_sync.py

Once the binding is correct, acknowledge the open incident.

5. Runbook: sensor offline

  1. Check Sensors panel — is it offline? When was it last seen?
  2. On the host: systemctl status arpg-sensor (or your make sensor-* foreground).
  3. Common causes:
    • Caps dropped after rebuild → make caps.
    • NIC not promiscuous → sudo ip link set ens33 promisc on.
    • JetStream consumer wedged → restart the correlator (it bootstraps with
    • DeleteConsumer + DeliverNew).

6. Runbook: detect→mitigate p95 above SLA

SLA target is <100 ms. The latency probe (make latency-probe) measures end‑to‑end via api/latency. If p95 climbs:

  1. NATS health: make stack should show the container healthy. Restart only if needed.
  2. PostgreSQL connections: pgxpool defaults; check no long‑running queries are holding
  3. the incidents/detections tables. SELECT pid, query, state FROM pg_stat_activity.

  4. Correlator backlog: Prometheus metric arpg_correlator_queue on :9110/metrics.
  5. Sensor: arpg-sensor bench 2000000 should report ~0.9 µs/frame on lab hardware. A
  6. regression here points at the hot path; check that target/release is current and that caps are still set.

7. SOC onboarding

Day‑0 checklist for a new analyst:

  • Account provisioned by an admin (Users page); role = analyst for read+ack/quarantine,
  • responder for binding writes, admin for users/policies/settings.

  • Walk through the Dashboard, click a sample incident, run Export Forensics to see the
  • bundle.

  • Run a benign poison from backend/generator/arp_attack.py against a test segment and
  • watch the full triage cycle.

  • Review SECURITY.md — the safety invariants, what auto‑acts and what doesn't.

7a. Notifications (Telegram)

Real-time alerts to a Telegram chat. Bot token is configured at the API process via the ARPG_TELEGRAM_BOT_TOKEN environment variable; with no token the dispatcher still runs but skips sending.

# At API start-up
ARPG_TELEGRAM_BOT_TOKEN=123456:AA…   \
ARPG_DASHBOARD_URL=https://arpg.example.com \
./arpg-api --addr :8080

Operator onboarding (per analyst):

  1. Open Telegram, find the bot (@yourbotname), send /start — the bot replies
  2. with the chat's chat_id.

  3. Easy path: in the dashboard go to **Settings → Notifications → Generate
  4. link code**, copy the 6-char code, send /link CODE to the bot. The chat is registered automatically against the operator's user account.

  5. Manual path: paste the chat_id from /start into **Add channel → Chat
  6. ID** with severity threshold (default CRITICAL) and optional quiet hours.

  7. Click Test — a test message should arrive immediately. If it doesn't, the
  8. row's Last error column tells you why (firewalled egress, bot kicked, etc.).

Coalescing: per (channel, attacker_mac) we coalesce to one message per 60s to survive incident storms. Status-change events (closed) always pass.

Spam guard tuning: change CoalesceWindow in backend/api/alerts/service.go if 60s is wrong for your traffic.

Channel commands (in a chat with the bot):

Command What it does
/start Reply with this chat's chat_id
/link CODE Bind this chat to your dashboard user (one-time code)
/stop Disable all alerts going to this chat
/whoami List channels registered for this chat
/help Command list

Rotation: the bot token lives only in the process env, never in DB or source. To rotate, revoke via @BotFather, restart arpg-api with the new token, registered chat_ids keep working unchanged.

8. Observability

  • Prometheus scrapes the correlator at :9110/metrics and the host node_exporter.
  • Grafana dashboard uid arpg-overview (provisioning in infra/grafana/). Login
  • admin / arpg_admin.

  • SIEM export via backend/control/siem_connector.py — CEF + ECS over syslog. Forward
  • to your existing collector.

9. Routine maintenance

Cadence Task
Daily Triage active incidents; review FP rate trend
Weekly Confirm sensor heartbeats; review binding churn; rotate JWT secret if compromised
Monthly Re‑train ML shadow model (make ml-train); sanity‑check eval scores
Per change Run make eval after baseline / rule changes; record measured precision/recall

10. Escalation

  • Sensor outage > 5 min on a protected segment → page network on‑call.
  • Detect→mitigate p95 > 500 ms sustained → page platform on‑call.
  • Auto‑mitigation circuit breaker tripped (more than N actions/minute) → analyst
  • reviews mitigation_audit immediately; do not bypass the breaker.