Observability | Event-driven Enterprise Data Platform

Observability is what makes a data platform trustworthy. Job-level monitoring is not enough. You need to track the data journey end-to-end so that when something breaks, you know where in the pipeline and why, not just that a job failed.

Key Takeaways

01 Monitor the journey, not just the jobs.
02 Freshness, Completeness, and Quality are the core pillars.
03 Automated alert routing to the correct domain owner.
04 Incident management requires clear runbooks and re-run paths.

Checklist

□ Freshness and completeness metrics defined per product.
□ Alerts configured with clear ownership (Platform vs Domain).
□ Replay/backfill runbooks linked in the alert description.
□ SLO dashboard published for data consumers.

What to monitor

Pipeline Health
Latency, throughput, and error rates of ingestion and transformation jobs.
Freshness (SLA)
Is the data currently in the warehouse up-to-date according to business needs?

Completeness
Did we lose any records between the source system and the final Gold product?
Data Quality
Are the values within expected ranges? (e.g., No null IDs, valid currency codes).

Data journey monitoring

Rather than monitoring isolated jobs, track the full journey of a data point from ingestion to visualization. With lineage metadata at each layer boundary, a latency spike in the Gold product points directly to which upstream step fell behind.

The Lineage-Aware Alert

Datadog dashboards

Platform Level

Cloud resource usage, global pipeline failure rates, system-wide latency.

Domain Level

Freshness of specific data products, data quality test results, domain-specific costs.

Executive Level

High-level KPI health, "Data Trust" score, and cost-to-value indicators.

GCP mapping

Cloud Logging/Monitoring for infrastructure, BigQuery Information Schema for data metrics, Datadog for unified visualization and alerting.

Failure modes

! Alert Fatigue: Too many low-priority alerts cause teams to ignore the critical ones.
! Hidden Stale Data: The data pipeline job "succeeded," but zero rows were processed (lack of completeness check).
! No Ownership: An alert fires for a cross-domain pipeline, and both teams assume the other is fixing it.
! Silent Drop: Data is dropped due to schema mismatch in an ingestion step but no alert is triggered because it's not a "system failure."