The Routing Intent by Leonardo Furtado
Posts
Why Observability Is Non-Negotiable for Network Engineering at Scale

Why Observability Is Non-Negotiable for Network Engineering at Scale

Forget Dashboards. Build Observability Pipelines That Detect, Explain, and Prevent

Leonardo Furtado
June 05, 2025

In a globally scaled network, where hundreds of thousands of peering points, millions of routing updates per hour, and petabits of data traverse virtualized infrastructure across regions, visibility isn’t a luxury.

It’s the air you breathe. It's really your lifeline.

When systems fail in these environments, they don’t fail politely. They fail quietly, asymmetrically, and at speeds that exceed human reaction time.

In traditional networks, we were trained to log in, trace the problem manually, and fix it:

SSH into a router,
Run a few show commands,
Correlate syslogs or SNMP traps,
Find the culprit.

That model worked when the problem space was local and the topology was small. However, in globally distributed networks, this paradigm breaks down. A single route redistribution misfire, a misconfigured VPC route table, or a leaky eBGP advertisement can ripple across transit gateways, edge locations, and private backbones, all without immediate symptoms.

By the time someone “notices,” you’re already in damage control mode.

Observability at Scale Is Not Monitoring. It’s Systems Engineering

Observability at this level isn’t a dashboard, a log viewer, or a packet sniffer. It’s a core engineering discipline that must be treated as a first-class design requirement, just like capacity planning or protocol selection.

And here’s a hard-won truth I’ve learned through years of experience:

“You can’t automate or scale what you can’t first observe, measure, and understand.”

This mindset shift is foundational. It goes beyond catching issues when they happen. It involves engineering the network in a way that makes issues observable, measurable, and eventually self-resolving.

Let’s walk through how we’ve built this thinking into high-scale production environments.

1. Flow Logs Are Just the Beginning

Flow logs are valuable, yep, they are, they tell you who talked to whom, on which port, and at what time. But at hyperscale, flow logs alone are entirely insufficient.

They must be supplemented and correlated with structured, high-fidelity telemetry from multiple dimensions:

What does that mean in practice?

Interface stats and drop counters (across access, aggregation, core, edge, and virtual interfaces).
BGP neighbor health, session flaps, route churn velocity, and RIB/FIB discrepancies.
Queue depth telemetry, buffer utilization, and ECN/RED counters.

We go further with:

Custom metadata tagging: Every flow is annotated with tags like:
- region=eu-west-1
- owner=cloud-security
- traffic_class=gold
- service=api-gateway
Correlation engines that link telemetry from compute, storage, and application layers, providing a multi-layer narrative of what happened.

At this scale, the goal is no longer just to “see the failure.”

The goal is to explain it, localize it, and predict its blast radius.

2. Metrics Without Context Are Just Noise

Collecting metrics is easy. Modern agents and exporters will give you thousands. We've been doing that for ages.

The problem is interpretation. Without domain-aware context, these metrics become mere background noise, making it challenging to figure out what's happening. At a large scale, you will get lost in the middle of it.

Our approach:

Golden Signals — You must curate and elevate the metrics that matter:
- Throughput
- Packet loss
- Latency and jitter
- Retransmissions and flow resets
SLIs and SLOs per tenant/segment:
- Define explicit Service Level Indicators for critical traffic classes (e.g., control plane vs. bulk data).
- Attach Service Level Objectives to teams or business outcomes, not just links.
Intent vs. Reality Checks:
- Build systems that compare the current network state against the declared design intent.
- Alert not when "interface down" happens, but when policy violations, unintended paths, or failure to meet routing design goals are detected.

Good observability doesn't just tell you when something fails. It tells you when the network diverges from what it was supposed to be doing.

3. Distributed Tracing For the Network

Application engineers have long used distributed tracing (e.g., OpenTelemetry, AWS X-Ray) to visualize request paths.

We have brought this concept to the network.

What network tracing enables:

Full flow tracing: Reconstruct a packet’s journey across:
- VPCs
- Transit Gateways
- NAT gateways
- Overlay tunnels
- Cross-region replication pipelines
Temporal context: We stitch together telemetry over time to:
- Rebuild events leading up to failure
- Visualize silent degradations (e.g., consistent RTT spike in one AZ)
Cross-domain correlation:
- A dropped packet isn’t just a “network issue.”
  It may relate to burst behavior in a distributed database, container migration, or storage IO limits.

With distributed flow tracing, we’re no longer stuck asking “Where did this go wrong?”.

Instead, we ask “How long did it take for this flow to cross each segment, and where did it slow down?”

4. Observability Pipelines, Not Just Dashboards

Dashboards are helpful. However, at cloud scale, they are far from being the ultimate goal.

They're one tiny UI view within a much larger data flow.

We treat observability as a streaming pipeline: an architecture for ingest, enrichment, analysis, and action.

The pipeline:

Ingest:
- Telemetry (e.g., gRPC, OpenConfig)
- Flow logs (NetFlow/IPFIX/vFlow)
- Control plane state (e.g., BGP, IS-IS, OSPF, Label Switching)
Enrich:
- Add contextual metadata: device role, service, zone, owning team, customer tier
Analyze:
- Real-time anomaly detection: sudden drop spikes, path churn, health flaps
- Pattern recognition: known bad flows, flapping routes, hot paths
Act:
- Trigger auto-remediation playbooks through proper workflow automation pipelines
- Open structured tickets with enriched context
- Update SLO dashboards and send intent-drift alerts

Good observability pipelines serve both people and machines.

They empower engineers and fuel the automation systems that make downtime disappear before it impacts users.

5. Security Observability (NetSecOps and Zero Trust Signals)

Why it matters:

As networks shift toward zero-trust architectures and distributed, identity-based access models, traditional perimeter-based monitoring becomes obsolete. Observability must now include security intelligence at every hop and layer.

Subscribe to our premium content to read the rest.

Become a paying subscriber to get access to this post and other subscriber-only content. No fluff. No marketing slides. Just real engineering, deep insights, and the career momentum you’ve been looking for.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• ✅ Exclusive career tools and job prep guidance
• ✅ Unfiltered breakdowns of protocols, automation, and architecture
• ✅ Real-world lab scenarios and how to solve them
• ✅ Hands-on deep dives with annotated configs and diagrams
• ✅ Priority AMA access — ask me anything