Observability is what makes a data platform trustworthy. Job-level monitoring is not enough. You need to track the data journey end-to-end so that when something breaks, you know where in the pipeline and why, not just that a job failed.
Key Takeaways
- 01 Monitor the journey, not just the jobs.
- 02 Freshness, Completeness, and Quality are the core pillars.
- 03 Automated alert routing to the correct domain owner.
- 04 Incident management requires clear runbooks and re-run paths.
Checklist
- □ Freshness and completeness metrics defined per product.
- □ Alerts configured with clear ownership (Platform vs Domain).
- □ Replay/backfill runbooks linked in the alert description.
- □ SLO dashboard published for data consumers.
What to monitor
- Pipeline Health
Latency, throughput, and error rates of ingestion and transformation jobs.
- Freshness (SLA)
Is the data currently in the warehouse up-to-date according to business needs?
- Completeness
Did we lose any records between the source system and the final Gold product?
- Data Quality
Are the values within expected ranges? (e.g., No null IDs, valid currency codes).
Data journey monitoring
Rather than monitoring isolated jobs, track the full journey of a data point from ingestion to visualization. With lineage metadata at each layer boundary, a latency spike in the Gold product points directly to which upstream step fell behind.
The Lineage-Aware Alert
Datadog dashboards
Platform Level
Cloud resource usage, global pipeline failure rates, system-wide latency.
Domain Level
Freshness of specific data products, data quality test results, domain-specific costs.
Executive Level
High-level KPI health, "Data Trust" score, and cost-to-value indicators.
Failure modes
- ! Alert Fatigue: Too many low-priority alerts cause teams to ignore the critical ones.
- ! Hidden Stale Data: The data pipeline job "succeeded," but zero rows were processed (lack of completeness check).
- ! No Ownership: An alert fires for a cross-domain pipeline, and both teams assume the other is fixing it.
- ! Silent Drop: Data is dropped due to schema mismatch in an ingestion step but no alert is triggered because it's not a "system failure."