Operations
What an operator / SOC analyst needs to run the platform day‑to‑day: console workflows, mitigation modes, runbooks for common incidents, on‑call escalation.
1. Console at a glance
- Dashboard — KPI bar (active incidents, detect→mitigate p95, false positives 24 h,
- Incidents —
Active/All/Archivetabs, full search & severity/status filters, - Segments — per‑VLAN host & binding counts, ARP rate vs. baseline, active incidents.
- Sensors — frames/24 h, p95 latency, VLAN coverage, online/degraded/offline.
- Binding Database — IP↔MAC↔Port↔VLAN truth table;
protected/baseline/learned. - Policies — per‑segment mitigation mode (
MONITOR/GUARDED/ENFORCE). - Audit Log — append‑only mitigation history with timestamp, operator, action, result.
- Users — admin only; create / delete / change role.
- Settings — global automation flags, retention, SIEM/email integration.
sensor coverage), incident trend (24 h / 7 d / 30 d), active incidents table, network segments grid, sensor status panel. Polls every 5 s.
click → modal with attacker/victim/MITRE/tier/rules + Quarantine / Acknowledge / Release / Export Forensics / Add to Allow‑list actions.
Default demo accounts (rotate before going live): admin/ARPG_ADMIN_PASSWORD, analyst/analyst123, responder/responder123, viewer/viewer123.
2. Mitigation modes
Set per segment in Policies or globally via the correlator start flag (MODE=).
| Mode | Auto‑acts | Use when |
|---|---|---|
MONITOR |
nothing (alert‑only) | Bring‑up, low confidence in baseline, audit periods |
GUARDED |
L1 corrective ARP | Default safe production — fixes the segment, doesn't quarantine |
ENFORCE |
L1 + L2 NAC quarantine | Hardened sites with working RADIUS CoA / NAC actuator |
Auto‑mitigation needs a deterministic Tier‑1 or Tier‑2 hit. Tier‑3 anomaly or ML‑shadow findings never auto‑act — they raise severity and require an analyst to push Quarantine. TTL auto‑revert returns the segment to its prior state if the threat doesn't recur.
3. Triage runbook (most common)
Symptom: A CRITICAL incident appears on the Dashboard.
- Open the incident modal. Read the trigger rules and the claimed IP/MAC.
- Cross‑check against Binding Database — is the attacker MAC known and protected,
- If the incident is genuine:
- In
GUARDED/ENFORCE, the controller has already taken L1 — confirm in - Click Quarantine to escalate to L2 if NAC is wired up.
- Capture forensics with Export Forensics (downloads JSON bundle with incident +
- If it's a false positive:
- Click Add to Allow‑list — the attacker's IP↔MAC pair becomes an approved binding.
- Update
vmac_allowlistdirectly via SQL if it's an HA / cluster vMAC. - Acknowledge the incident.
or is the IP a gateway that's supposed to be at a different MAC?
Audit Log that the corrective ARP went out.
detections + audit rows).
4. Runbook: false positive on legitimate gateway ARP
Almost always caused by a stale lab baseline. The real gateway MAC must match what's in bindings.
# Find the real MAC
arp -a | grep 192.168.10.1
# Or from the sensor host:
ip neigh show 192.168.10.1
# Sync baseline (control/baseline_sync.py reads bindings table; edit there first if needed)
psql -h 127.0.0.1 -U postgres -d arpg -c \
"UPDATE bindings SET mac = '<REAL_MAC>' WHERE ip = '192.168.10.1' AND vlan = 10;"
python3 backend/control/baseline_sync.py
Once the binding is correct, acknowledge the open incident.
5. Runbook: sensor offline
- Check Sensors panel — is it
offline? When was it last seen? - On the host:
systemctl status arpg-sensor(or yourmake sensor-*foreground). - Common causes:
- Caps dropped after rebuild →
make caps. - NIC not promiscuous →
sudo ip link set ens33 promisc on. - JetStream consumer wedged → restart the correlator (it bootstraps with
DeleteConsumer + DeliverNew).
6. Runbook: detect→mitigate p95 above SLA
SLA target is <100 ms. The latency probe (make latency-probe) measures end‑to‑end via api/latency. If p95 climbs:
- NATS health:
make stackshould show the container healthy. Restart only if needed. - PostgreSQL connections:
pgxpooldefaults; check no long‑running queries are holding - Correlator backlog: Prometheus metric
arpg_correlator_queueon:9110/metrics. - Sensor:
arpg-sensor bench 2000000should report ~0.9 µs/frame on lab hardware. A
the incidents/detections tables. SELECT pid, query, state FROM pg_stat_activity.
regression here points at the hot path; check that target/release is current and that caps are still set.
7. SOC onboarding
Day‑0 checklist for a new analyst:
- Account provisioned by an admin (
Userspage); role =analystfor read+ack/quarantine, - Walk through the Dashboard, click a sample incident, run Export Forensics to see the
- Run a benign poison from
backend/generator/arp_attack.pyagainst a test segment and - Review SECURITY.md — the safety invariants, what auto‑acts and what doesn't.
responder for binding writes, admin for users/policies/settings.
bundle.
watch the full triage cycle.
7a. Notifications (Telegram)
Real-time alerts to a Telegram chat. Bot token is configured at the API process via the ARPG_TELEGRAM_BOT_TOKEN environment variable; with no token the dispatcher still runs but skips sending.
# At API start-up
ARPG_TELEGRAM_BOT_TOKEN=123456:AA… \
ARPG_DASHBOARD_URL=https://arpg.example.com \
./arpg-api --addr :8080
Operator onboarding (per analyst):
- Open Telegram, find the bot (
@yourbotname), send/start— the bot replies - Easy path: in the dashboard go to **Settings → Notifications → Generate
- Manual path: paste the
chat_idfrom/startinto **Add channel → Chat - Click Test — a test message should arrive immediately. If it doesn't, the
with the chat's chat_id.
link code**, copy the 6-char code, send /link CODE to the bot. The chat is registered automatically against the operator's user account.
ID** with severity threshold (default CRITICAL) and optional quiet hours.
row's Last error column tells you why (firewalled egress, bot kicked, etc.).
Coalescing: per (channel, attacker_mac) we coalesce to one message per 60s to survive incident storms. Status-change events (closed) always pass.
Spam guard tuning: change CoalesceWindow in backend/api/alerts/service.go if 60s is wrong for your traffic.
Channel commands (in a chat with the bot):
| Command | What it does |
|---|---|
/start |
Reply with this chat's chat_id |
/link CODE |
Bind this chat to your dashboard user (one-time code) |
/stop |
Disable all alerts going to this chat |
/whoami |
List channels registered for this chat |
/help |
Command list |
Rotation: the bot token lives only in the process env, never in DB or source. To rotate, revoke via @BotFather, restart arpg-api with the new token, registered chat_ids keep working unchanged.
8. Observability
- Prometheus scrapes the correlator at
:9110/metricsand the host node_exporter. - Grafana dashboard uid
arpg-overview(provisioning ininfra/grafana/). Login - SIEM export via
backend/control/siem_connector.py— CEF + ECS over syslog. Forward
admin / arpg_admin.
to your existing collector.
9. Routine maintenance
| Cadence | Task |
|---|---|
| Daily | Triage active incidents; review FP rate trend |
| Weekly | Confirm sensor heartbeats; review binding churn; rotate JWT secret if compromised |
| Monthly | Re‑train ML shadow model (make ml-train); sanity‑check eval scores |
| Per change | Run make eval after baseline / rule changes; record measured precision/recall |
10. Escalation
- Sensor outage > 5 min on a protected segment → page network on‑call.
- Detect→mitigate p95 > 500 ms sustained → page platform on‑call.
- Auto‑mitigation circuit breaker tripped (more than N actions/minute) → analyst
reviews mitigation_audit immediately; do not bypass the breaker.