Five years ago, when Novastraxis operated in 12 regions with a few hundred services, monitoring was sufficient. We had dashboards. We had alerts. We had a playbook for every scenario we had seen before. And then our platform grew to 48 regions, 2,400+ customers, and 14 billion daily events — and monitoring stopped being enough.

The problem with monitoring is that it answers questions you have already thought to ask. Is CPU above 80%? Is the error rate above 1%? Is the p99 latency above 200ms? These are useful questions. But in a distributed system with thousands of services, the failure modes you have not anticipated vastly outnumber the ones you have. Monitoring tells you that something is wrong. Observability gives you the tools to understand why — even when you are encountering a failure mode for the first time.

This article is a practitioner's account of how we built our observability platform at Novastraxis. I am going to share the architectural decisions, the trade-offs, the mistakes, and the hard-won lessons from operating at a scale where the naive approaches to metrics, traces, and logs simply do not work. If you are running distributed systems at any scale, the patterns here should be applicable to your environment.

The Difference Between Monitoring and Observability (and Why It Matters)

Monitoring is a subset of observability, not a synonym for it. Monitoring is the practice of collecting, aggregating, and alerting on predefined metrics and conditions. It works well for known failure modes — the scenarios you have encountered before and can write threshold-based rules for. If your database connections exceed the pool maximum, an alert fires. If disk utilization crosses 90%, a page goes out.

Observability is the property of a system that allows you to understand its internal state from its external outputs. A system is observable when you can ask arbitrary questions about its behavior and get meaningful answers without deploying new instrumentation. The difference is profound: monitoring requires you to predict what will go wrong and build checks in advance. Observability allows you to investigate problems you did not anticipate.

In our first year at scale, we had over 40 incidents where our monitoring dashboards showed green while customers experienced degraded service. The metrics we were tracking — CPU, memory, error rates — were all within normal ranges. But the actual problem was a subtle interaction between three services in a specific region that caused a cascading retry storm during a particular traffic pattern. No predefined metric captured that scenario. Only when we could trace individual requests across all three services and correlate their timing with the customer-reported symptoms could we identify and resolve the root cause. That experience taught us that monitoring alone was structurally inadequate for our operational needs.

Three Pillars Aren't Enough — You Need Correlation

The conventional wisdom is that observability rests on three pillars: metrics, traces, and logs. This is technically correct but practically insufficient. Having all three signal types is necessary, but the real power comes from correlating between them — being able to jump from a metric anomaly to the specific traces that generated it, and from those traces to the specific log entries within each span.

Without correlation, your three pillars are three separate tools that each provide a partial view. An engineer investigating an incident has to manually cross-reference timestamps between their metrics dashboard, their trace viewer, and their log search. At our scale, this manual correlation adds 15-20 minutes to every investigation — time that compounds across hundreds of incidents per month.

Our observability platform implements what we call unified signal correlation. Every metric emission includes the trace ID and span ID of the operation that generated it. Every log entry includes the same identifiers. This means an engineer can click on a metric data point and immediately see the individual traces that contributed to that metric value, then drill into any trace to see the log entries for each span. The investigation path is metric to trace to log, and the transitions are seamless.

We also add a fourth signal type that the three-pillar model misses: events. System events — deployments, configuration changes, scaling operations, certificate rotations, security incidents — provide essential context for understanding why system behavior changed at a particular point in time. When a latency spike correlates with a deployment event, the root cause is often immediately obvious. Without events in the correlation model, engineers waste time investigating symptoms that a deployment timeline would have explained in seconds.

How We Instrument 14 Billion Daily Events Without Performance Overhead

When your instrumentation processes 14 billion events per day, the instrumentation itself becomes a performance concern. If each event adds even 10 microseconds of overhead, you are spending 39 hours of compute time per day on observability alone. At our scale, instrumentation overhead must be effectively zero — meaning it must be below the noise floor of normal performance variation.

We achieve this through three architectural decisions. First, all instrumentation is asynchronous and non-blocking. Metric emissions, trace span completions, and log entries are written to lock-free ring buffers that are consumed by background threads. The application thread never blocks on an observability operation. If the ring buffer fills because the background consumer is slow, entries are dropped rather than causing backpressure — we prefer losing some observability data over impacting application performance.

Second, we use compile-time instrumentation wherever possible. Our service framework generates tracing instrumentation at build time based on service interface definitions, eliminating the runtime cost of dynamic instrumentation. The generated code is optimized for each specific service's call patterns, avoiding the overhead of generic instrumentation libraries that must handle arbitrary use cases.

Third, sampling decisions are made at the edge — at the earliest point in the request lifecycle — and propagated through the entire call chain. This means downstream services know immediately whether a request is being sampled, allowing them to skip the overhead of detailed instrumentation for non-sampled requests. The measured overhead of our instrumentation is 0.3% CPU per service instance for sampled requests and effectively unmeasurable for non-sampled requests.

Trace-Driven Development: Using Traces as the Primary Debugging Tool

Two years ago, when an engineer at Novastraxis needed to debug a production issue, they would start with logs. Search for error messages. Correlate timestamps. Reconstruct the request path manually. It worked, but it was slow — our average investigation time was 23 minutes per incident.

We made a deliberate cultural shift to trace-driven development: traces are now the primary debugging artifact, not logs. When an issue occurs, engineers start by finding the relevant trace — either from a metric anomaly, an error alert, or a customer-reported request ID. The trace shows the complete request path across all services, with timing for each span, error annotations, and direct links to the log entries within each span. From a single trace, an engineer can see exactly where latency accumulated, which service returned an error, and what the downstream impact was.

The impact on investigation time was dramatic. Our average time to identify root cause dropped from 23 minutes to 7 minutes after adopting trace-driven debugging as the default workflow. For cross-service issues — where the symptom appears in one service but the cause is in another — the improvement was even more pronounced, dropping from an average of 45 minutes to 12 minutes. Traces make cross-service causality visible in a way that log-based investigation simply cannot.

What made trace-driven debugging work for us:

100% trace coverage for all inter-service communication — no gaps in the trace graph
Rich span annotations including database queries, cache hit/miss ratios, and feature flag states
Direct links from trace spans to the corresponding log entries, eliminating timestamp-based log correlation
Customer-facing request IDs that map directly to trace IDs, enabling support-to-engineering handoff in seconds

Log Aggregation at 500TB/Day: Architectural Decisions

Our platform generates approximately 500 terabytes of log data per day. At that volume, every architectural decision has massive cost and performance implications. The naive approach — ship everything to a centralized store and query it on demand — would cost roughly $2.3 million per month in storage alone and deliver query performance measured in minutes, not seconds.

Our log architecture uses a tiered storage model with aggressive early-stage processing. Logs are first processed by edge aggregators running in each region, which perform structured extraction, enrichment (adding trace IDs, service metadata, and region tags), and initial filtering. Logs matching real-time alert rules are forwarded immediately to the alert processing pipeline. All logs are then written to a regional hot storage tier (7-day retention) for fast queries, and asynchronously replicated to a global cold storage tier (365-day retention) for compliance and long-term analysis.

The single most impactful cost optimization was moving from full-text indexing to columnar storage with selective indexing. Instead of indexing every field in every log entry, we index only the fields that engineers actually query — trace ID, service name, log level, error code, and customer ID. All other fields are stored in columnar format and available for ad-hoc queries, but without the storage overhead of maintaining a full-text index. This reduced our hot storage costs by 62% while maintaining sub-second query performance for the most common investigation patterns.

Log sampling is another critical lever. Not every log entry needs to be retained at full fidelity. Debug and info-level logs for healthy services are sampled at 10% by default, while error and warning logs are always retained at 100%. When a service enters an unhealthy state — detected by our SLO monitoring — sampling automatically increases to 100% for all log levels, ensuring complete visibility during incidents. This adaptive sampling reduces our log volume by approximately 40% during normal operations without sacrificing visibility when it matters most.

SLO-Based Alerting vs Threshold-Based Alerting

Threshold-based alerting is the default approach for most organizations, and it has a fundamental flaw: the thresholds are arbitrary. When you set a CPU alert at 80%, what does 80% actually mean for your users? Nothing — it is a number you picked because it seemed reasonable. The actual relationship between CPU utilization and user experience is nonlinear and varies by service. Some services run happily at 95% CPU. Others start degrading at 40% because they are latency-sensitive and context switching kills their performance.

We replaced threshold-based alerting with SLO-based alerting three years ago, and the impact on alert quality was transformative. Instead of alerting on resource metrics, we alert on error budget burn rate. Each service has a defined Service Level Objective — for example, 99.95% of requests complete successfully in under 200ms over a 30-day window. The error budget is the allowed failure rate: 0.05% of requests, or roughly 21.6 minutes of total downtime per month.

Alerts fire when the error budget burn rate indicates that the service will exhaust its error budget before the end of the window. A fast burn — consuming error budget at 14x the sustainable rate — triggers an immediate page. A slow burn — consuming at 3x the sustainable rate — triggers a ticket for investigation during business hours. This approach automatically adjusts alert sensitivity based on the severity of the impact: a brief spike in errors that consumes 0.1% of the error budget does not page anyone, while a sustained degradation that threatens the SLO triggers escalation.

The result: our alert volume dropped by 73% after migrating to SLO-based alerting, while our incident detection rate actually improved. We were paging less and catching more real problems. The false positive rate dropped from 34% to under 5%, which had a measurable impact on on-call quality of life and engineer retention. Alert fatigue is not just an operational problem — it is a people problem, and SLO-based alerting addresses both.

The Cost of Observability: Our Approach to Sampling and Retention

Observability at scale is expensive. Without careful cost management, observability costs can easily reach 15-25% of total infrastructure spend. Our target is to keep observability costs below 8% of total infrastructure cost, and we have consistently hit that target through a combination of intelligent sampling, tiered retention, and aggressive data lifecycle management.

Our sampling strategy is based on a simple principle: retain 100% of the data you are likely to need and sample everything else. In practice, this means 100% retention for all error traces, all traces exceeding latency thresholds, all traces for specific high-value customers, and all traces during active incidents. For normal traffic from healthy services, we sample traces at 1% — enough to compute accurate aggregate statistics and detect emerging patterns, but far less expensive than full retention.

The key insight is that sampling decisions must be made at the head of the trace, not the tail. Head-based sampling decides whether to sample a trace before the request is processed, ensuring that all spans in a sampled trace are retained. Tail- based sampling waits until the request completes and decides based on the outcome (error, high latency, etc.), which provides better signal but requires temporarily buffering all trace data. We use a hybrid approach: head-based sampling for volume control with tail-based promotion for traces that turn out to be interesting. A trace that was not head-sampled but encounters an error is promoted to full retention after completion.

Our retention tiers:

Hot tier (7 days): Full-fidelity metrics, sampled traces and logs, sub-second query performance for active investigation
Warm tier (30 days): Downsampled metrics (1-minute granularity), error and high-value traces, filtered logs
Cold tier (365 days): Hourly metric aggregates, incident- correlated traces only, compliance-required log entries
Archive tier (7 years): Daily metric summaries and audit logs only, for regulatory compliance requirements

What We Got Wrong (and How We Fixed It)

Building observability at this scale involved significant mistakes. Here are the three most expensive ones and what we learned from them.

Mistake 1: Treating observability as an infrastructure concern, not a product concern

For the first two years, our observability platform was owned by the infrastructure team. They built excellent infrastructure — reliable pipelines, fast storage, good tooling. But the instrumentation quality across services was inconsistent because application teams treated observability as someone else's problem. We fixed this by making observability a product requirement: every service must meet defined instrumentation standards as a condition of production readiness. Observability is not infrastructure. It is a product capability that every team owns for their service.

Mistake 2: Over-indexing on collection and under-investing in query experience

We built a pipeline that could ingest and store massive volumes of telemetry data. But the query experience was painful — engineers had to write complex queries in a custom query language to extract useful information. As a result, only the most experienced engineers could effectively use the observability platform. The fix was investing heavily in pre-built investigation workflows: templated queries, automated correlation, and guided investigation paths that walk an engineer from symptom to root cause without requiring query expertise. Usage increased 4x after we made the query experience accessible.

Mistake 3: Not accounting for the observability pipeline as a critical dependency

In 2024, we had an incident where our observability pipeline itself experienced a partial outage — during a customer-facing incident that we were actively investigating. The engineers trying to debug the customer issue lost visibility into the system at the worst possible moment. We now treat our observability infrastructure with the same reliability requirements as our customer-facing services: redundant ingestion paths, independent monitoring of the monitoring system, and graceful degradation that preserves core investigation capability even when parts of the pipeline are impaired.

The Ongoing Journey

Observability is not a project you complete. It is a capability you continuously improve as your system grows and evolves. Every new service, every new deployment pattern, every new failure mode teaches you something about what you need to observe and how. The platform we operate today is unrecognizable compared to what we had three years ago, and I fully expect it to be unrecognizable again three years from now.

The most important lesson from our journey is that observability is a cultural practice, not just a technical capability. The best observability tooling in the world is useless if engineers do not use it, and engineers will not use it if it is difficult, slow, or disconnected from their workflow. Invest as much in the user experience of your observability platform as you do in its data pipeline. Make investigation fast. Make correlation automatic. Make the path from alert to root cause as short as possible. That is what turns observability from a cost center into a competitive advantage.

For technical details on the infrastructure that powers our observability platform, see our Core Infrastructure page. For our approach to correlating security events with operational telemetry, read our cloud-native security architecture deep dive. And for platform-level observability capabilities available to our customers, explore the observability platform documentation.

See Our Observability Platform in Action

Novastraxis provides enterprise-grade observability across all 48 global regions, with unified signal correlation, SLO-based alerting, and sub-second investigation workflows for distributed systems at any scale.

Explore Core Infrastructure Schedule a Demo

Observability Is Not Monitoring: Lessons from Running 48 Global Regions