Architecture | Event-driven Enterprise Data Platform

This design builds the platform around one structural decision: decouple producers from consumers through a shared event spine, then layer storage and transformation so each stage has a clear contract with the next. Everything else follows from that.

Key Takeaways

01 End-to-end flow from diverse sources to a structured data mesh.
02 Unified ingestion patterns: Events, CDC, and Batch.
03 Layered storage (Bronze/Silver/Gold) preserves full data lineage with quality gates at each boundary.
04 Federated governance and platform-as-a-product approach.

Checklist

□ Validate all end-to-end flows against domain requirements.
□ Establish cross-cutting concerns (IAM, Encryption, CI/CD).
□ Define the platform 'paved road' for domain teams.
□ Document key architectural decisions (ADRs).

Platform overview (end-to-end)

What this diagram shows: The end-to-end data lifecycle from raw ingestion to business-ready data products and final consumption.

Key design points: Decoupled layers, immutable raw storage, and a dedicated data product layer for domain autonomy.

Operational notes: Every transition between layers is an observability checkpoint. Ingestion must be idempotent.

GCP mapping

Pub/Sub, Dataflow, Cloud Storage, BigQuery, Cloud Run, Dataplex/Data Catalog, Looker, Cloud KMS, Datadog.

Ingestion patterns

What this diagram shows: The three primary ingestion lanes and their respective handling of validation and failures.

Key design points: Mandatory schema validation, central DLQ for all patterns, and explicit replay paths.

Operational notes: Idempotency is critical at the 'Success' point to allow safe replays from the DLQ.

Transformation flow

What this diagram shows: The movement of data through quality gates from raw to refined formats.

Key design points: Decoupled transformation steps with automated quality checks before promotion to the next layer.

Operational notes: Backfill paths must be automated and able to process point-in-time snapshots from Bronze.

Data mesh operating model

What this diagram shows: The relationship between domain teams, platform engineering, and governance.

Key design points: Domains own the products, Platform owns the 'road', and Governance sets the 'rules'.

Operational notes: The platform team should be measured on domain enablement, not on the data itself.

Observability / data journey

What this diagram shows: How monitoring signals translate into actionable incidents for the correct teams.

Key design points: Lineage-aware alerting and clear separation between infrastructure and data logic failures.

Operational notes: MTTR is reduced by automatically routing alerts based on the failure type (e.g., connection vs. schema validation).

Reference architecture narrative

Data enters the platform through three primary channels: real-time application events via Pub/Sub, database state changes via Change Data Capture (CDC), and bulk files from legacy or SaaS providers. All data is immediately persisted in its raw form in the Bronze layer of the data lake.

Transformation pipelines then pick up this raw data, cleaning and conforming it into the Silver layer. Here, data is structured into reusable domain entities. Finally, domain-specific logic aggregates this into the Gold layer, where it is exposed as curated Data Products.

Key architectural decisions

Immutability: Raw data is never modified; all changes result in new versions or downstream updates.
Schema-first: Ingestion requires a defined contract (Avro/Protobuf) to prevent downstream breakage.
Domain Ownership: The teams closest to the business logic own the data product lifecycle.
Unified Observability: Every pipeline must emit standardized metrics for freshness and quality.
Stateless Transformation: Pipelines are designed to be re-runnable from raw data at any time.

Trade-offs and alternatives

Central Lake Team vs. Mesh

Central teams are easier to start with but become a bottleneck. Mesh requires higher organizational maturity but scales better.

Streaming vs. Batch-first

Streaming provides lower latency but higher complexity. This blueprint favors a 'streaming-first' approach for core events, falling back to batch for heavy lookups.

CDC vs. Domain Events

CDC is easier to implement for legacy apps but couples the data platform to internal DB schemas. Domain events are preferred for new services.

Failure modes

! Data Drift: Sources change schemas without updating the central catalog, breaking downstream Silver/Gold pipelines.
! The "Mess" Mesh: Lack of federated standards leads to incompatible data products across different domains.
! Poison Pillar: A single malformed event causes a streaming pipeline to crash repeatedly (lack of DLQ).
! Backfill Explosion: Inefficient re-processing logic causes massive cloud costs when rebuilding historical data.