SYSTEM_CONSOLE v2.4.0

Observability

LAST_UPDATED: 2025-09

Observability is what makes a data platform trustworthy. Job-level monitoring is not enough. You need to track the data journey end-to-end so that when something breaks, you know where in the pipeline and why, not just that a job failed.

Key Takeaways

  • 01 Monitor the journey, not just the jobs.
  • 02 Freshness, Completeness, and Quality are the core pillars.
  • 03 Automated alert routing to the correct domain owner.
  • 04 Incident management requires clear runbooks and re-run paths.

Checklist

  • Freshness and completeness metrics defined per product.
  • Alerts configured with clear ownership (Platform vs Domain).
  • Replay/backfill runbooks linked in the alert description.
  • SLO dashboard published for data consumers.

What to monitor

  • Pipeline Health

    Latency, throughput, and error rates of ingestion and transformation jobs.

  • Freshness (SLA)

    Is the data currently in the warehouse up-to-date according to business needs?

  • Completeness

    Did we lose any records between the source system and the final Gold product?

  • Data Quality

    Are the values within expected ranges? (e.g., No null IDs, valid currency codes).

Data journey monitoring

Rather than monitoring isolated jobs, track the full journey of a data point from ingestion to visualization. With lineage metadata at each layer boundary, a latency spike in the Gold product points directly to which upstream step fell behind.

The Lineage-Aware Alert

Data journey lineage alert

Datadog dashboards

Platform Level

Cloud resource usage, global pipeline failure rates, system-wide latency.

Domain Level

Freshness of specific data products, data quality test results, domain-specific costs.

Executive Level

High-level KPI health, "Data Trust" score, and cost-to-value indicators.

GCP mapping
Cloud Logging/Monitoring for infrastructure, BigQuery Information Schema for data metrics, Datadog for unified visualization and alerting.

Failure modes

  • ! Alert Fatigue: Too many low-priority alerts cause teams to ignore the critical ones.
  • ! Hidden Stale Data: The data pipeline job "succeeded," but zero rows were processed (lack of completeness check).
  • ! No Ownership: An alert fires for a cross-domain pipeline, and both teams assume the other is fixing it.
  • ! Silent Drop: Data is dropped due to schema mismatch in an ingestion step but no alert is triggered because it's not a "system failure."