Telemetry & Forensics: Logs to Capture for Fast Outage RCA

A practical checklist for edge, origin, and DNS logs — retention, correlation, and playbooks to cut MTTR and accelerate outage forensics.

Stop guessing — capture the right logs to find root cause in minutes, not days

When a CDN edge goes dark, DNS responses spike with SERVFAIL, or an origin starts returning 5xx, every minute of diagnosis costs your customers, your SLA, and your reputation. In 2026, outages are rarely single-point failures: they span edge, origin, DNS, and cloud control planes. This article gives a practical, actionable checklist of which logs to capture, how long to keep them, and exactly how to correlate them so you can accelerate post-incident analysis and speed mean time to resolution (MTTR).

Executive summary — what to capture and why

Capture structured logs from three domains first: edge logs (CDN and POP), origin access logs (web servers, load balancers, app servers), and DNS query logs (recursors and authoritative servers). Supplement those with cloud control-plane events, network flow/BGP logs, and security telemetry. Keep time-synchronized, request-correlated data and preserve raw logs during incidents. Retain hot indexes for 30–90 days depending on SLOs, with cold storage for 1–2 years when compliance or forensic value requires it.

Checklist: Essential log sources (edge, origin, DNS, cloud)

Below is a prioritized checklist. Each item includes the minimal fields you must capture and why they matter for forensics.

1) Edge logs (CDN / POP)

Why: Edge logs show what the CDN served, cache effectiveness, TLS/handshake issues, and regional failures.

Minimal fields: timestamp (ISO8601 + ms), request_id/trace_id, client_ip, client_geo (POP), host/hostname, url/path, method, status_code, bytes_sent, cache_status (HIT/MISS/REVALIDATED/BYPASS), origin_response_time, edge_response_time, edge_server_id/pop_id, TLS_version/cipher, user_agent, referer, set-cookie headers (masked), response_headers (cache-control), upstream_origin_id.
Why each matters: cache_status + origin_response_time isolates whether an outage was cache-related or origin-related. POP and edge_server_id identify regional or POP-wide failures.
Format: structured JSON per request. Avoid text-only logs.

2) Origin access logs (LB, web/app servers, object storage)

Why: Origins show backend behavior — application errors, slow queries, and resource saturation.

Minimal fields: timestamp, request_id (propagated from edge), client_ip (edge or client), method, path, status_code, backend_processing_time, db_query_time (if available), upstream_service_id, pod/container id, host, resource utilization snapshot (CPU/memory at request time if possible), error/stacktrace when 5xx.
Why: If requests arrive to origin but time out or error, you can correlate with edge miss patterns. Resource snapshots help prove contention or OOM crashes.

3) DNS query logs (recursive resolvers & authoritative)

Why: DNS is often a silent cause of global outages; query logs show client queries, response codes, and resolver behavior.

Minimal fields: timestamp, transaction_id, client_ip (resolver subnet), qname (queried name), qtype, response_code (NOERROR/NXDOMAIN/SERVFAIL), authoritative_ip, answer_ttl, response_ips, EDNS client subnet (ECS) info, EDNS flags, DNSSEC validation status, zone_version, server_id.
Why: Patterns like SERVFAIL spikes, large volumes of NXDOMAIN, or a change in authoritative IPs reveal misconfigurations, propagation gaps, or DDoS targeting DNS infrastructure.

4) Cloud provider control-plane and audit logs

Why: Many incidents stem from automated config changes, API errors, or region-specific control plane issues.

Minimal fields: timestamp, api_call, principal (IAM user/role/service), request_parameters, response_status, request_id, resource_arn, region, changelog/stack update id.
Why: Audit logs let you see who or what changed networking, DNS records, ingress routes, firewall rules, or autoscaling policies immediately before an outage.

5) Network telemetry (VPC flow logs, NetFlow, BGP)

Why: Network-level drops, flapping BGP routes, or peering failures often surface only in network flows.

Minimal fields: timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bytes, packets, action (ACCEPT/DROP), interface_id, route_table_id, bgp_prefix, next_hop, as_path (if available).
Why: Correlating netflows with edge errors shows whether packets were dropped or never reached the origin.

6) Security telemetry (WAF, IDS/IPS, auth logs)

Why: Security engines can unintentionally block legitimate traffic or reveal attack patterns that look like outages.

Minimal fields: rule_id, action (BLOCK/ALLOW), request_id, src_ip, matched_payload (masked), timestamp, severity.
Why: A new WAF rule or false-positive can create large 403/406 spikes at the edge.

Retention strategy — balance speed, cost, and compliance

Your retention must align with incident response needs, compliance, and cost. Use a tiered model: hot, warm, cold, and archive.

Recommended retention tiers (2026 guidance)

Hot (0–30 days): Full structured logs with indices for fast search. This covers immediate post-incident analysis and initial RCA. Keep high-cardinality fields indexed (request_id, status_code, host).
Warm (30–90 days): Reduced indices, compressed columnar store (e.g., ClickHouse / OLAP), optimized for aggregation queries and timelines.
Cold (90–365 days / 1 year): Parquet/ORC files in object storage (S3/compatible) with partitioning by date and host. Useful for trend analysis and compliance-driven forensics.
Archive (1–7 years): Lowest-cost storage with retrieval SLA (compliance or legal requirement). Encrypt and store immutable snapshots for critical systems.

Retention policy decision factors

SLO/SLA and business impact — high-value services deserve longer hot retention.
Compliance — GDPR/data residency, financial or healthcare rules may require 1–7 years.
Storage cost vs forensic value — store full raw logs for 30 days; compress and keep summarized logs beyond that.
Legal hold — snapshot and preserve logs if litigation or regulatory inquiries are expected.

Correlation strategies — tie everything to a single truth

Correlation is the multiplier that turns logs into answers. Without consistent keys and time, you’ll spend hours matching traces manually.

1) Propagate a unique request identifier

Implement and enforce a global request_id/trace_id propagated across edge, origin, and internal services. Use OpenTelemetry or a lightweight header like X-Request-ID and ensure CDNs forward it.

Every log entry related to a request must include this ID.
When not present, add synthetic IDs at the edge and inject into downstream requests.

2) Time is your friend — sync clocks and normalize zones

All systems must use NTP/Chrony and log timestamps in UTC with millisecond resolution. Normalize on ingestion if some sources are legacy.

3) Enrich logs at ingestion

Enrich incoming logs with metadata: service name, deployment version, environment, region, and SLO tier. This enables fast grouping and filtering.

4) Use structured logs and schema evolution

Prefer JSON/Protobuf/Avro logging. Define required fields and use schema registries so parsing is reliable during high-volume incidents.

5) Index smart — not everything, but the right things

Index request_id, timestamp, status_code, host, and cache_status. Avoid indexing free text fields like user_agent at index time; parse at query time or store as keyword subsets.

6) Cross-domain joins and lookups

Design queries and dashboards that join edge logs, origin logs, and DNS logs on request_id and timestamps. When request_id is missing for DNS queries, join by client_ip + time window and host name.

Forensic playbook: step-by-step during an outage

Having logs is only half the job — you must execute a forensic playbook. Below is a prioritised checklist to run during an outage.

Immediate containment (first 10 minutes)

Freeze retention/rollover policies: prevent automated log deletion during investigation.
Snapshot relevant indices and S3 prefixes for the last 72 hours (edge, DNS, control-plane).
Activate a dedicated incident index with increased sampling and debug logs from core services.

Rapid triage (10–30 minutes)

Check DNS query logs for SERVFAIL/NXDOMAIN spikes and authoritative changes in the last 15 minutes.
Inspect edge logs for cache_status spikes (large MISS rates) and regional POP failures.
Verify whether origin logs show increased latency or 5xx errors correlated to edge misses.
Review cloud control-plane events for recent API-driven changes (route updates, firewall, autoscaling).

Deep forensics (30–180 minutes)

Assemble a timeline: map DNS queries → edge receives request → edge calls origin → origin responses. Use request_id and timestamps.
Query network flow logs for packet drops or interface errors between edge POP IPs and origin IPs.
Pull WAF and IDS logs for correlated rule hits that may have blocked traffic.
Collect process, container, and pod logs for fatal errors and OOMs. If necessary, capture live debugging snapshots.

Post-incident (same day to 30 days)

Preserve raw logs and export forensic artifacts to a secure archive (immutable storage).
Create a detailed RCA with timeline, affected customers, evidence snippets (queries), and recommended mitigations.
Automate lessons learned: add new monitors, alert thresholds, and automated runbooks triggered by the patterns found.

Example: In the Jan 2026 multi-provider outage events, teams that had DNS query logs and edge-origin correlation recovered faster — they could see whether traffic was failing at recursive resolvers, being dropped at POPs, or timing out at origins.

Practical examples: fields, queries, and joining strategies

Below are concrete examples you can apply to your ELK/ClickHouse/BigQuery or SIEM stack.

Key fields to index and join on

request_id / trace_id — primary join key (propagate everywhere).
timestamp — normalized to UTC, ms resolution.
client_ip and effective_client_ip (after X-Forwarded-For parsing).
host / qname.
cache_status and status_code.
pop_id / edge_server_id and origin_server_id.

Sample forensic query patterns

Examples assume structured JSON logs in a query engine.

Find requests with edge errors but no origin entries: filter edge where status_code>=500 and request_id not in origin logs in same 5s window.
DNS SERVFAIL spike: count by minute of response_code==SERVFAIL grouped by authoritative_ip and client_subnet.
Cache miss wave: count cache_status==MISS by pop_id; correlate with origin_response_time percentiles to see if origin became slow at same time.

Sizing and cost estimates (rule-of-thumb for 2026)

Estimate log volume before designing retention. These are conservative starting points; measure and iterate.

Edge logs: 200–1,000 bytes per request (JSON). For 100K RPS at peak, that’s ~8–40 TB/day raw. Index only critical fields.
Origin logs: 200–400 bytes per request. For 10–20K RPS, ~2–7 TB/day.
DNS query logs: 100–250 bytes per query. High-volume authoritative servers might log 1–10M qps; plan for 10s of TB/day when full logging is enabled.

Cost controls: sample low-priority logs, compress to columnar formats, and use lifecycle policies to move to cold/archival storage.

Advanced strategies & 2026 trends to adopt

In 2026, the telemetry landscape changed: sovereign clouds, multi-CDN strategies, and AI-assisted forensics require updated thinking.

1) Multi-CDN and cross-provider tracing

Multi-CDN increases resilience but complicates correlation. Standardize request_id propagation across providers and enrich logs with provider tags (Cloudflare ray id, Akamai request id, AWS CloudFront x-amz-cf-id) so you can attribute failures quickly.

2) Sovereign clouds and data residency

With AWS European Sovereign Cloud and similar launches in 2025–26, ensure your retention policies respect data residency. Where logs cannot leave a region, deploy regional collectors and run cross-region analytics on aggregated metadata only.

3) eBPF and high-fidelity network observability

Use eBPF-based probes for low-overhead packet and socket-level telemetry inside hosts and edge appliances. This fills gaps when application logs are missing.

4) AI-assisted anomaly detection and RCA

In 2026, mature teams augment dashboards with AI models that surface correlated anomalies (e.g., simultaneous DNS SERVFAIL and cache MISS spikes). Use these as assistants, not oracles—validate before actioning.

5) Immutable logging & tamper-evidence

For forensics and compliance, write critical logs to immutable tiers or append-only ledgers. This protects RCA artifacts in disputes.

Common pitfalls and how to avoid them

No request_id propagation: Make this non-negotiable. Without it, correlation becomes human-intensive.
Unsynced clocks: Enforce NTP and timestamp normalization during ingestion.
Over-indexing: Index only what you query frequently; store the rest compressed.
Not freezing retention during incidents: Automate retention freeze and snapshot creation at incident start.
Missing DNS telemetry: DNS is often overlooked; add query logging at both recursive and authoritative layers.

Quick checklist you can copy

Edge: enable structured JSON logs; include request_id, cache_status, pop_id, origin_response_time.
Origin: forward request_id from edge; include backend timings and error stacktraces.
DNS: enable query logs on authoritative and recursive servers; include response codes and ECS where used.
Cloud: ingest control-plane audit logs and tag changes to networking and DNS.
Network: enable VPC flow logs and BGP peering logs for edge–origin circuits.
Retention: hot 0–30d, warm 30–90d, cold 90–365d, archive 1–7y (adjust for compliance).
Correlation: enforce global request_id, UTC timestamps, structured JSON, and enrichment at ingestion.
During incident: freeze retention, snapshot indices, enable debug logs, assemble timeline.

Final takeaways

Outages in 2026 span providers and layers. The teams that diagnose and resolve outages quickly are those that planned telemetry with correlation in mind: structured logs, request IDs, time sync, and tiered retention. Make DNS query logs, edge logs, and origin logs first-class citizens of your observability strategy. Combine these with cloud audit and network telemetry to get a complete forensic picture.

Call to action

Want a practical audit of your telemetry posture? Host-server.cloud runs a 60-minute Telemetry Forensics Workshop for DevOps and SRE teams: we map your current logs, propose retention tiers optimized for cost and compliance, and deliver a concrete plan to reduce MTTR. Book a workshop or download our incident-ready logging checklist to start closing the gap between alerts and answer.