Saturday, November 22, 2025

Building Trustable Telemetry: A Secure, Scalable Audit‑Governance Architecture for Distributed Validation

## 15‑Minute Demo Script — Docker Compose (ready-to-run)


Prep (2 minutes)

- Confirm prerequisites: Docker and docker-compose installed.

- Place these files in an empty directory:

  - architecture_telemetry_audit.png (openable for intro)

  - demo_telemetry_audit_docker_compose.sh (make executable)

- Make script executable:

  chmod +x demo_telemetry_audit_docker_compose.sh


Start (1 minute)

- Run:

  ./demo_telemetry_audit_docker_compose.sh

- Wait ~20–30s for services to start.


Minute 0:30 — Intro (1 minute)

- Show architecture diagram (PNG) and state demo goals:

  - Simulate validation node heartbeats → registry

  - Telemetry → Kafka

  - Heartbeat Monitor → Prometheus metrics (Online/Degraded/Offline)

  - Show failure, alerting state, and recovery

  - Explain where immutable audit & evidence bundling fit in full system


Minute 1:30 — Verify baseline (2 minutes)

- Open registry dump:

  http://localhost:5000/dump

  Expected: JSON with node-demo-1 last-seen ts

- Open Prometheus UI:

  http://localhost:9090

  Query metrics:

  nodes_online  → expected 1

  nodes_degraded → 0

  nodes_offline → 0

- Run Kafka consumer to show telemetry:

  docker run --rm --network host confluentinc/cp-kafka:latest bash -c "kafka-console-consumer --bootstrap-server localhost:9092 --topic telemetry --from-beginning --max-messages 3"

  Expected: 3 JSON telemetry messages


Talking point: telemetry and heartbeat flows are decoupled — heartbeat to registry, telemetry to Kafka.


Minute 3:30 — Simulate degraded/offline (4 minutes)

- Pause producer (simulates agent stall):

  Identify producer container:

  docker ps --filter "ancestor=python:3.11-slim"

  Pause it:

  docker pause <container-id>

- Explain thresholds: HBMon uses 10s (Degraded) and 25s (Offline) in the demo.

- Wait ~15–30s and refresh Prometheus queries:

  nodes_degraded should become >0 then nodes_offline >0 (within thresholds).

- Show registry: last-seen timestamp older than expected.

- Talking points:

  - How HBMon determines state from last-seen

  - Alerts would be routed via Alertmanager → PagerDuty in production

  - Audit trail: HBMon writes signed state changes to immutable audit (simulated by noting registry dump and timestamps)


Minute 7:30 — Triage & remediation (4 minutes)

- Unpause producer to simulate resolution:

  docker unpause <container-id>

- Check registry dump: new ts appears quickly.

- In Prometheus, nodes_online returns to 1, degraded/offline to 0.

- Demonstrate a quick agent restart alternative:

  docker-compose stop producer && docker-compose start producer

  Verify telemetry resumes via Kafka consumer (run consumer for a few messages).

- Talking points:

  - Automated remediation options (soft restart, reschedule)

  - Evidence capture: on recovery, produce evidence bundle containing heartbeats, logs, and digests (explain where this is implemented in full system)


Minute 11:30 — Forensics & audit (2 minutes)

- Show registry dump timestamps and explain immutability plan:

  - In production: append-only store + S3 WORM + digest anchoring

- Show where evidence builder would read TSDB/ES/registry and create export.


Minute 13:30 — Wrap up & next steps (1.5 minutes)

- Summarize what demo proved:

  - Heartbeat detection, state transitions, telemetry ingestion, Kafka consumption, recovery flow

  - Where anomaly detection, automated remediation, and audit bundling integrate

- Offer next options:

  - Replace HBMon thresholds to mirror production timing

  - Swap lightweight stubs for real components (Flink, Elastic, S3 WORM)

  - Add Alertmanager -> PagerDuty integration and automated runbooks


Commands cheat‑sheet (copy/paste)

- Start demo:

  ./demo_telemetry_audit_docker_compose.sh

- Registry dump:

  curl http://localhost:5000/dump

- Prometheus UI:

  http://localhost:9090 (query metrics: nodes_online, nodes_degraded, nodes_offline)

- Pause producer:

  docker pause <container-id>

- Unpause producer:

  docker unpause <container-id>

- Kafka consumer:

  docker run --rm --network host confluentinc/cp-kafka:latest bash -c "kafka-console-consumer --bootstrap-server localhost:9092 --topic telemetry --from-beginning --max-messages 5"

- Tear down:

  docker-compose down -v


End.


Copyright (c) 2025 Cory Miller

No comments:

Post a Comment

Building Trustable Telemetry: A Simple Guide to Secure, Verifiable System Data

​ Building Trustable Telemetry: A Simple Guide to Secure, Verifiable System Data Modern systems generate huge amounts of telemetry—logs,...