Back to Blog
Data ArchitectureMarch 5, 202610 min read

Data Mesh at Scale: How We Govern 14 Petabytes Without a Central Data Team

Three years ago, we dismantled our 47-person central data engineering team and bet the company on domain ownership. Here is what happened, what broke, and why we would absolutely do it again.

Marcus Chen

Chief Technology Officer, Novastraxis

In early 2023, Novastraxis had a data problem that was invisible from the outside but slowly strangling us internally. We had a central data engineering team of 47 people responsible for ingesting, transforming, and serving data for the entire company. On paper, it looked efficient — a single team with a unified data platform, shared tooling, and centralized governance. In practice, it was a bottleneck that was costing us millions in delayed decisions and missed opportunities.

The data team had a backlog of 340 open requests. Average turnaround for a new data pipeline was 14 weeks. Product teams that needed analytics for a launch would either wait — sometimes missing their market window — or build shadow pipelines that nobody documented, tested, or maintained. Our data quality was declining because the central team was spread too thin to implement proper validation. And our best data engineers were burning out from context-switching between eight different business domains they could not possibly understand deeply enough to build great data products.

Something had to change. After evaluating several organizational models, we committed to data mesh — a decentralized approach where domain teams own their data end-to-end, supported by a self-serve platform and governed by federated computational policies. Three years later, we manage 14 petabytes of data across 23 domain teams with no central data engineering function. This is the story of how we got here, the decisions we made along the way, and the hard lessons we learned.

The Problem with Centralized Data Teams

Centralized data teams are the default organizational pattern for data engineering, and for good reason — they consolidate expertise, reduce tooling sprawl, and provide a single point of accountability for data quality. At small to medium scale, they work well. But as organizations grow, the centralized model develops three fatal flaws.

First, the knowledge bottleneck. A central data engineer serving the security analytics domain needs to understand threat models, detection logic, SIEM integrations, and compliance requirements. The same person might also serve the customer success domain, which requires understanding subscription models, churn indicators, NPS methodology, and CRM data structures. No one person can hold deep context across all these domains simultaneously. The result is pipelines that are technically correct but semantically wrong — they move data from point A to point B without understanding what the data means or whether it is fit for purpose.

Second, the prioritization problem. When every team funnels requests through a single data team, every request competes for the same pool of engineering capacity. The security team needs a new threat intelligence pipeline. The product team needs real-time feature usage analytics. The finance team needs updated revenue attribution models. Who wins? Usually whoever shouts loudest or has the most senior executive sponsor. This is not resource allocation — it is organizational politics masquerading as engineering planning.

Third, the ownership vacuum. When a central team builds a pipeline for a domain team, who owns the data quality? The domain team says it is the data team's responsibility — they built it, after all. The data team says they built what was specified and the domain team should validate the outputs. The result is data that nobody truly owns, and data without clear ownership inevitably degrades.

Domain Ownership: How We Restructured

We divided our data landscape into 23 domains aligned with business capabilities, not organizational hierarchy. The security analytics domain owns all data products related to threat detection and incident response. The customer domain owns customer profiles, interaction histories, and churn predictions. The platform telemetry domain owns infrastructure metrics, logs, and performance data.

Each domain has a designated data product owner — a senior engineer within the domain team who is accountable for the quality, availability, and discoverability of that domain's data products. This is not an additional role bolted onto their existing job. We explicitly allocated 40-60% of their time to data product ownership, and we adjusted their performance evaluations to reflect this responsibility. Without that explicit allocation, data ownership becomes an unfunded mandate that nobody prioritizes.

The 47 engineers from the former central data team were redistributed to domain teams based on their deepest domain expertise and preferences. Twenty-three became embedded data engineers within domain teams. Fourteen moved to the platform team to build the self-serve data infrastructure (more on that below). Six transitioned to a small federated governance team. Four left the company — not because they disagreed with the change, but because their skills were most valuable in centralized environments and they preferred that model. We supported their transitions and maintained good relationships.

The transition period was genuinely difficult. For the first four months, data pipeline reliability actually decreased as domain teams ramped up on tooling and workflows that were previously handled by specialists. We saw a 23% increase in pipeline failures during months two and three. But by month six, pipeline reliability had recovered to pre-transition levels, and by month nine, it surpassed them — because domain engineers understood their data deeply enough to build more robust validation and error handling.

Self-Serve Data Infrastructure: The Platform That Makes Mesh Work

Data mesh without a self-serve platform is just decentralized chaos. If you distribute ownership without providing the tooling that makes data engineering accessible to non-specialists, you will end up with 23 teams building 23 different data platforms, each with different standards, different toolchains, and different definitions of data quality. That is worse than the centralized model, not better.

Our self-serve data infrastructure — which we now offer as a product through our Data Mesh Engine — provides domain teams with everything they need to build, deploy, and operate data products without deep platform expertise. The platform handles ingestion connectors for over 200 data sources, from databases and APIs to event streams and file stores. It provides a declarative pipeline framework where teams define transformations in SQL or Python and the platform handles orchestration, scaling, and failure recovery. It includes a data catalog that automatically indexes every data product with its schema, lineage, quality metrics, and ownership information.

The critical design decision was the abstraction level. Too low, and domain engineers spend all their time wrestling with infrastructure. Too high, and power users feel constrained. We landed on what we call the contract layer — domain teams define what their data products look like (schema, quality guarantees, freshness SLAs) and the platform figures out how to deliver them. If a team wants to override the default implementation for performance or cost reasons, they can drop down to a lower abstraction level, but they accept responsibility for the operational complexity that comes with it. In practice, about 80% of data products are built using the high-level contract layer, and 20% use custom implementations for performance-critical use cases.

Federated Computational Governance

Governance is the word that makes data mesh skeptics nervous, and honestly, it should. Without strong governance, data mesh becomes a collection of disconnected data silos with inconsistent quality and no interoperability. The trick is implementing governance that is strict enough to ensure consistency but lightweight enough that it does not recreate the bottleneck you just eliminated.

Our answer is computational governance — governance policies that are codified as automated checks rather than manual review processes. When a domain team publishes a new data product, it must pass a battery of automated validations before it becomes discoverable in the catalog. These validations check schema compliance with our global type system (every field must map to a registered semantic type), documentation completeness (every data product must have a description, owner, SLA, and example queries), quality gate compliance (freshness, completeness, and accuracy metrics must meet minimum thresholds), privacy classification (every field must be tagged with a data sensitivity level), and lineage declaration (upstream dependencies must be explicitly declared).

These checks run automatically in our CI/CD pipeline. There is no governance review board. There is no approval committee. If your data product passes the automated checks, it ships. If it does not, you fix it and try again. The governance team's role is not to review data products — it is to maintain and evolve the automated validation rules. They write code, not approval emails.

The federated part means that domain teams participate in governance decisions. Each domain has a representative on the governance council that meets monthly to propose, discuss, and vote on changes to the global governance rules. This prevents the governance team from becoming a de facto central authority that dictates standards without understanding domain-specific requirements. Any rule that receives three or more domain objections is sent back for revision.

Data Contracts and SLAs: The Glue That Holds It Together

Data contracts are the single most important operational mechanism in our data mesh. A data contract is a formal agreement between a data product producer and its consumers that specifies exactly what the consumer can expect: the schema (field names, types, and semantics), the freshness SLA (how recent the data will be), the completeness guarantee (acceptable rates of null or missing values), the availability target (uptime commitment), and breaking change policy (how much advance notice consumers receive before schema changes).

Data contracts are defined in a declarative YAML format and versioned alongside the data product code. When a producer wants to make a breaking change — dropping a field, changing a type, altering semantics — the contract system automatically identifies all consumers who depend on the affected elements and generates a migration timeline. Breaking changes require a 30-day deprecation window during which both the old and new schemas are supported simultaneously. This is not optional. The platform enforces it.

SLA enforcement is equally automated. Every data product has continuous monitoring that tracks freshness, completeness, and accuracy against its contract commitments. When a product violates its SLA, the system generates an incident automatically, routes it to the owning domain team, and tracks time-to-resolution. We report SLA compliance metrics monthly to the engineering leadership team, and persistent violations become a performance conversation with the domain's engineering director.

Getting data contracts right took us longer than anything else in the data mesh journey. Our first attempt was too permissive — contracts were optional and nobody wrote them. Our second attempt was too rigid — contracts required specifying so many details that teams spent more time writing contracts than building data products. The current version strikes a balance: mandatory fields cover the critical interoperability concerns, and optional fields allow teams to communicate additional guarantees when their consumers need them.

Observability for Data Quality

You cannot govern what you cannot observe. Data quality in a mesh architecture requires the same rigor as service reliability in a microservices architecture — comprehensive monitoring, alerting, and incident management. We built our data observability layer on three pillars.

Statistical Profiling

Every data product is continuously profiled to track distributions, cardinality, null rates, and pattern frequencies. When a metric deviates more than three standard deviations from its historical baseline, an anomaly alert fires. This catches data quality issues that schema validation alone cannot detect — like a pipeline that is technically producing valid records but the distribution of values has shifted dramatically due to an upstream bug. We process over 2.3 billion quality checks daily across our 14 petabytes of managed data.

Lineage-Aware Impact Analysis

When a quality issue is detected in a data product, the observability layer traces the impact through the lineage graph to identify all downstream consumers affected. If a raw event stream develops a freshness issue, the system automatically determines which derived data products, dashboards, and ML models depend on it and notifies their owners. This transforms data quality incidents from mysterious downstream failures into traceable, root-caused issues.

SLA Dashboards and Scorecards

Every domain team has a real-time dashboard showing their data products' SLA compliance. We also publish a monthly scorecard that ranks domains by data quality and reliability. This creates healthy competition and organizational visibility into data health. Domains that consistently score above 99.5% SLA compliance are recognized in our engineering all-hands, and those that fall below 95% receive targeted support from the platform team.

Lessons Learned After Three Years

Three years into our data mesh journey, here are the lessons I would share with any CTO considering this path. These are not theoretical observations — they are hard-won insights from operating data mesh at scale across a Fortune 500 enterprise.

The platform is more important than the organizational model

You can have perfect domain boundaries and ideal ownership structures, but without a self-serve platform that makes data engineering accessible, domain teams will drown. We spent 60% of our initial investment on platform tooling and 40% on organizational change. In hindsight, that ratio was about right. Organizations that flip it — spending primarily on reorgs and consulting while underinvesting in platform — struggle to make mesh work operationally.

Data contracts are non-negotiable

Without formal data contracts, domain ownership becomes domain isolation. Contracts are the mechanism that turns independent data products into an interoperable data ecosystem. We tried operating without mandatory contracts for the first three months and the result was chaos — breaking changes without notice, undefined SLAs, and consumers who had no idea what guarantees they could rely on.

Plan for the transition dip

Data quality and pipeline reliability will get worse before they get better. This is not a failure — it is the natural consequence of redistributing expertise. Prepare your leadership team for a 4-6 month transition period where metrics decline. Set expectations proactively, define what acceptable degradation looks like, and have rollback criteria documented (but do not actually roll back unless things go truly sideways).

Invest in data literacy across the engineering org

When every domain team is responsible for its own data products, every engineer needs a baseline understanding of data engineering principles. We developed an internal data engineering certification program — a 40-hour curriculum covering pipeline design, data modeling, quality assurance, and our platform toolchain. Eighty percent of our product engineers have completed it, and the investment has paid for itself many times over in reduced support tickets and higher-quality data products.

Do not underestimate the cultural shift

Data mesh is at least as much an organizational transformation as a technical one. Engineers who have always viewed data as someone else's problem need to internalize that they are now data product owners. This requires changing incentive structures, updating job descriptions, and making data product quality a first-class evaluation criterion. We adjusted our engineering ladder to include data stewardship expectations at every level from senior engineer upward.

The Results: Data Mesh by the Numbers

Three years in, the measurable outcomes have exceeded our projections. Here is where we stand compared to the centralized model we replaced.

14 weeks → 9 days

Average time from data request to production pipeline

340 → 12

Open data pipeline requests in backlog

94.2% → 99.7%

Average data product SLA compliance

23 domains

Independently operating data domains across the org

847 products

Published data products in the federated catalog

4.2x

Increase in cross-domain data product consumption

Would We Do It Again?

Absolutely, without hesitation. Data mesh is not perfect. There are days when I miss the simplicity of telling one team to build one pipeline. There are challenges with cross-domain data products that require coordination between multiple teams. There are ongoing debates about where to draw domain boundaries as the business evolves.

But the fundamental premise — that the people who understand the data best should own it — has proven correct at every scale we have tested. Our domain teams build better data products because they understand the business context. They respond faster because they do not have to compete for shared resources. And they take ownership because the quality of their data products is directly reflected in their team's metrics and their personal evaluations.

If you are considering data mesh for your organization, my advice is straightforward: invest in the platform first, set clear governance guardrails before you decentralize, establish data contracts as a foundational requirement, and prepare your organization for a transition period that will feel like a step backward before it becomes a leap forward. The destination is worth the journey. Learn more about our approach on the about page or explore how the core infrastructure platform supports data mesh at enterprise scale.

Ready to Explore Data Mesh?

Our Data Mesh Engine powers federated data governance for enterprises managing petabytes of data across dozens of domain teams. See how it works or talk to our data architecture team.