SYSTEM_CONSOLE v2.4.0

Event-driven Enterprise Data Platform

A greenfield enterprise data platform: streaming-first, domain-owned, built on GCP. Designed for sub-minute data freshness, federated governance, and GDPR compliance at ingestion.

LAST_UPDATED: 2025-06

This blueprint is derived from production experience building a greenfield enterprise data platform at a large retail organisation. Operational data was locked inside SAP and Oracle, served to analytics teams via brittle overnight ETL jobs, 24 to 48 hours stale by the time anyone could query it. The goal was a streaming-first, domain-owned platform delivering sub-minute-fresh data to BI, operational tooling, and AI/ML pipelines, with governance and GDPR compliance built into the ingestion path rather than retrofitted later.

The patterns here are not novel. LinkedIn built Kafka specifically to solve this class of problem at scale and published the architecture in 2011. Netflix, Uber, and Spotify each converged on the same structural decisions independently: immutable event log as the source of truth, domain teams owning their data products, and a thin platform team providing the infrastructure and standards rather than owning the data itself. Zhamak Dehghani formalised the domain-ownership model as Data Mesh in 2019, and since then Zalando, HelloFresh, JPMorgan, and Intuit have published their own implementations following the same principles.

Every decision recorded here reflects a real production trade-off. Where one approach was chosen over another, the reasoning is documented. Where something fails in practice, it is called out. None of this is vendor documentation.

Platform at a glance

Logical architecture: sources to consumers, with failure paths.

Platform at a glance
——→ primary data flow
··→ failure path to dead letter queue (DLQ)

The actual problem

The specific failures that drove this architecture. Not generic "data silos", but the real operational pain that forced a greenfield rebuild:

  • 01 24–48 hour analytics lag. All data access to SAP and Oracle was via scheduled DB dump exports into a flat file store. No streaming, no CDC. Inventory data was always stale by the time the BI team got it.
  • 02 Two departments reporting different revenue figures. Finance and operations had each built their own ETL logic from the same Oracle source, with different join conditions and fiscal calendar interpretations. Neither was wrong; both were inconsistent.
  • 03 Central data team as a bottleneck. Eight engineers owned all ingestion and transformation logic for 40+ product teams. Every new data source required a ticket, a sprint, and a four-week queue.
  • 04 Silent pipeline failures. ETL jobs failed without alerting. Business users discovered stale data by noticing that yesterday's sales figures were identical to last week's. MTTR was measured in days.
  • 05 GDPR as a fire drill. Subject access requests took 3 days because no one had a reliable map of which database columns contained personal data. Sensitive fields were scattered across 200+ tables with no classification metadata.

Why event-driven

The shift to event-driven ingestion was not a technology preference. It was driven by a specific latency requirement: inventory and order status needed to be available in BI within five minutes of a change in the operational system. Scheduled batch jobs cannot reliably achieve that without over-provisioning, polling at intervals that hammer source databases, and building error-prone watermark logic.

The secondary benefit was forcing explicit domain contracts. When you tell a team "you need to publish an Avro schema and a named event for every business change you want to share," you surface implicit domain knowledge that was previously embedded in SQL join conditions. That process is uncomfortable and slow initially. It pays off when the data mesh scales beyond five domains.

For legacy systems (SAP, Oracle) where publishing application events is impractical, Change Data Capture (CDC) via database log tailing fills the gap. CDC gives you near-real-time row-level changes without modifying application code. The trade-off is that CDC events reflect internal database structure, not business semantics. A transformation step is still needed to produce meaningful domain events from raw CDC rows.

GCP mapping
Pub/Sub is the primary message bus. For CDC from Oracle, MySQL, and Postgres, Google Datastream replicates directly into BigQuery or GCS without managing a Debezium cluster. Kafka/Confluent is the documented alternative for multi-cloud requirements or teams with existing Kafka investment.

Why data mesh, and where it is genuinely difficult

A centralised data team cannot sustainably own data quality for 40 independent product domains. The data mesh model was chosen not as an architectural ideal but as the only operationally viable option at this scale: push data product ownership to the teams with the deepest domain knowledge, and give the platform team a different mandate: building the tooling and standards that make self-service safe.

The trade-off is real and should not be minimised. Mesh requires significantly more organisational maturity than a centralised lake. Domain teams need to understand SLOs, schema versioning, and access control models. They will resist it initially. The platform team needs to provide tooling that makes compliance easier than non-compliance, not governance via process documents.

If your organisation has fewer than five data-producing domains or fewer than 15 people on data engineering, start with a centralised lake. Introduce mesh incrementally when the central team becomes the demonstrable bottleneck. Premature mesh is worse than no mesh.

Risk
The most common mesh failure mode: federated ownership without federated standards. Domains produce data products in incompatible formats: different date conventions, different customer ID schemes, different null handling. The result is a "data swamp" with distributed ownership instead of a centralised one. Standards must be enforced programmatically, not by asking teams to read a wiki.

Core architectural principles

These governed every design decision. They are opinionated by design.

  1. P1
    Bronze is write-once, append-only.

    Raw data is never mutated. Every transformation creates a new artefact in a downstream layer. If something goes wrong in Silver or Gold, you can always reprocess from Bronze. The Bronze layer is also the audit log.

  2. P2
    No schema contract, no ingestion.

    Every event source must register an Avro or Protobuf schema before it can publish to the message bus. The schema registry is a hard dependency, not an optional catalogue. The most common production failure is a producer silently changing a field type and crashing every downstream consumer. The schema registry blocks it.

  3. P3
    Every pipeline must be re-runnable from Bronze.

    Stateless transforms, idempotent writes. If a pipeline cannot be safely replayed from raw data without producing duplicates or incorrect state, it is not production-ready. Non-negotiable for GDPR correction workflows, where you may need to reprocess data after a field is pseudonymised.

  4. P4
    Observability is a deployment gate, not a follow-up task.

    A pipeline without structured telemetry (lag metrics, error rates, record counts at each layer boundary) does not get merged to main. OpenTelemetry instrumentation is part of the pipeline scaffold, not something added later. Silent failures are the most expensive failures.

  5. P5
    GDPR classification happens at ingestion.

    Every field is classified as PII, sensitive, or unrestricted before it lands in the lake. The classification drives automatic access policy and determines whether tokenisation or pseudonymisation is required before Bronze storage. Retrofitting this is orders of magnitude more expensive than building it in.

  6. P6
    Cost is a first-class architectural constraint.

    Streaming at scale is expensive. BigQuery on-demand pricing becomes untenable at 10 TB+ query volumes. Dataflow autoscaling without maximum worker caps generates surprise invoices. Storage lifecycle policies for Bronze/Silver/Gold must be defined in Terraform from day one, not added when the CFO asks questions.

Technology decisions

GCP is the reference implementation. Every decision has a documented alternative for teams not on GCP or with existing investments in other tooling.

Concern Primary choice Rationale Alternative
Message bus GCP Pub/Sub Managed, global, scales without ops overhead. Native Dataflow integration. Ordering semantics via ordered topics. Kafka / Confluent. Preferred for multi-cloud or existing Kafka investment. Richer replay semantics via consumer groups.
Stream processing Google Dataflow (Apache Beam) Unified batch and streaming model. Autoscaling. Exactly-once processing guaranteed. No cluster management. Apache Flink for sub-100ms latency. Apache Spark (Dataproc) for heavy batch transformations with existing Spark code.
Analytical store BigQuery Serverless column-store. Native streaming inserts. Direct Looker integration. Slot reservations for predictable cost. Snowflake or Databricks Delta Lake. Both viable, with stronger dbt ecosystem support.
Raw storage Cloud Storage (GCS) Cheap, durable, tiered lifecycle. Parquet/Avro on GCS is the Bronze layer. Nearline/Coldline for aged data. AWS S3, Azure ADLS. Identical pattern, different SDK.
CDC Google Datastream Managed CDC from Oracle, MySQL, Postgres. Streams directly to BigQuery or GCS. No Debezium cluster to operate. Debezium + Kafka Connect. More flexible, more ops overhead. Better for non-GCP or when Kafka is already present.
Batch orchestration Cloud Composer (Airflow) Managed Airflow with native GCP operators. DAG-based dependency tracking. Large ecosystem of existing operators. Prefect or Dagster for Python-native teams who want stronger type safety and better local testing ergonomics.
Observability OpenTelemetry + Datadog OTel for vendor-neutral instrumentation in pipeline code. Datadog for unified dashboards, APM, and alerting. GCP Cloud Monitoring alone is insufficient for cross-service data journey tracking. Grafana + Prometheus viable.
Infrastructure Terraform All GCP resources managed as code. Module library for reusable ingestion pipeline scaffolding. Pulumi for teams preferring TypeScript/Python over HCL.

Scope

What this covers

  • Streaming, CDC, and batch ingestion patterns
  • Data lake architecture (Bronze / Silver / Gold)
  • Transformation (Dataflow, dbt, Spark trade-offs)
  • Data mesh operating model and data products
  • Observability and data journey monitoring
  • BI and semantic layer (Looker / LookML)
  • Security, IAM, and GDPR governance
  • Cost controls and operational model
  • Phased implementation roadmap

What this does not cover

  • ×ML model training or feature store architecture
  • ×Real-time OLTP API design
  • ×Vendor selection framework or bake-off methodology
  • ×Full legal / regulatory compliance deep-dive
  • ×BI report design or dashboard UX
  • ×Application architecture for event-producing services

Lambda vs Kappa: an honest comparison

Lambda architecture runs separate batch and streaming pipelines in parallel, merging results at query time. Kappa architecture runs everything as a stream, using reprocessing of the event log to replace batch jobs. Both have a role in this blueprint, and neither is always the right answer.

Lambda: when to apply it

Large historical backfills and complex aggregations over months of data. A Dataflow streaming job computing a 90-day rolling average every five minutes is extremely expensive compared to a scheduled BigQuery SQL job running once per hour. For high-latency-tolerant aggregations, batch wins on cost.

Use when: cost matters more than latency.

Kappa: when to apply it

All real-time ingestion and most Silver-layer transformations. A single Dataflow pipeline handles both live events and historical replay from Pub/Sub snapshots or GCS exports. Simpler to operate: one pipeline, one codebase, one failure mode to reason about.

Use when: latency requirements are under five minutes.

Key takeaways

  • 01 Event-driven ingestion is justified by latency requirements, not architectural preference. Start with CDC for legacy systems; graduate to domain events for new services as teams build schema discipline.
  • 02 Data mesh scales domain ownership but multiplies governance overhead. The platform team's job shifts from building pipelines to building tooling that makes domain teams self-sufficient without making inconsistent choices.
  • 03 Bronze immutability and pipeline idempotency are the safety net for everything else: backfills, GDPR corrections, incident recovery. Compromise them and you lose the ability to recover cleanly from any production failure.
  • 04 Cost management is an architecture problem, not a finance problem. Streaming at scale with uncapped Dataflow workers and on-demand BigQuery billing will produce surprise invoices within the first month. Build the guardrails before you need them.

Failure modes

  • !
    Premature mesh. Distributing ownership before the platform tooling is mature enough to make compliance easy. Teams produce incompatible data products. Standards exist only in a Confluence document nobody reads. The result is a distributed mess that is harder to fix than a centralised one.
  • !
    Schema drift at source. A producer deploys a backend change that renames a field. The schema registry blocks the event at ingestion. The DLQ fills. The downstream Gold product goes stale. No one notices for four hours because there are no freshness monitors. Root cause: schema validation without downstream freshness alerting.
  • !
    Replay without idempotency. An ops team triggers a backfill to recover from a three-hour outage. The pipeline does not deduplicate on event ID. Every order record is double-counted in the Gold layer. Revenue reports for the day are wrong. Fixing it requires a full reprocess of the affected partition.
  • !
    Cost explosion from unbounded streaming. A Dataflow job autoscales to 200 workers during a spike because no maximum was set. The spike is a bug: a downstream consumer is publishing 10,000 events per second in a retry loop. The bill for that evening is larger than the monthly budget. Fix: max worker caps, DLQ monitoring, cost budget alerts at 50% and 80% thresholds.
  • !
    PII landing in the wrong layer. A developer adds a customer_email field to an existing event schema without updating the classification metadata. The field lands unmasked in a Gold table that has broad read access. The field exists in production for six weeks before a GDPR audit catches it.

Checklist: before you start building

  • Identified a pilot domain with a clear data latency problem and an engaged domain team.
  • GCP project structure and IAM hierarchy defined in Terraform before any data pipeline is deployed.
  • Schema registry provisioned and schema review process defined, including who approves breaking changes.
  • GDPR data classification taxonomy agreed with legal before the first schema is registered.
  • BigQuery slot reservation strategy decided. On-demand is acceptable for initial build; switch to reservations when monthly query volume exceeds 10 TB.
  • OpenTelemetry instrumentation scaffold built into the pipeline template, not added to individual pipelines later.
  • Dead letter queue (DLQ) and replay runbook defined for each ingestion pattern before going live.
  • Cloud Billing budget alerts configured at 50%, 80%, and 100% of monthly spend target per GCP project.