SYSTEM_CONSOLE v2.4.0

Operations & Cost

LAST_UPDATED: 2025-11

A data platform without cost controls and clear operational ownership will fail in production, regardless of how well the architecture is designed. The operating model here defines which team owns which failure, and the cost guardrails define what happens when a pipeline misbehaves at scale.

Key Takeaways

  • 01 Clear ownership boundaries between Platform and Domain teams.
  • 02 Automated cost allocation via domain-specific tagging.
  • 03 Retention policies to manage storage lifecycle and costs.
  • 04 Operational readiness reviews (Game Days) for reliability.

Checklist

  • Cost allocation tags in place for all cloud resources.
  • Replay and backfill limits defined to prevent cost spikes.
  • Ownership model published and agreed upon by all stakeholders.
  • Storage retention policies configured for Bronze/Silver/Gold.

Operating model

Platform Team

  • Infrastructure uptime (BigQuery, Pub/Sub, GKE).
  • IAM & Security guardrails.
  • CI/CD pipeline templates.
  • Centralized observability tooling.

Domain Team

  • Data product quality and freshness.
  • Transformation logic and bug fixes.
  • Domain-specific cost management.
  • Incident response for data logic failures.

Cost controls

Cloud data platforms can scale infinitely, and so can their costs. Implement multi-layered controls:

Budgets

Set hard and soft limits at the GCP Project / Domain level. Alerting at 50%, 80%, and 100% of monthly budget.

Tagging

Every resource must have a 'domain' and 'environment' tag for automated billing export and chargeback.

Retention

Bronze data (raw) kept for 7 years (compliance). Silver/Gold transient data kept for 90 days unless specified otherwise.

Risk
Ungoverned full-table scans on massive BigQuery datasets can consume a monthly budget in hours. Use partitioning and clustering for all large tables.

Reliability practices

Game Days

Periodic simulation of failures (e.g., source DB goes offline, schema breaks) to test team response and runbooks.

Operational Readiness

Before a data product goes to 'Production', it must pass a review of its monitoring, runbooks, and cost estimates.

GCP mapping
Cloud Billing Reports, Billing Export to BigQuery (for custom Dashboards), Cloud Quotas, BigQuery Reservations (Flat-rate vs Flex).

Failure modes

  • ! Cost Explosion: A recursive logic error in a transformation pipeline causes massive compute consumption overnight.
  • ! Orphaned Resources: Temporary tables or staging files are never deleted, leading to slowly creeping storage costs.
  • ! The "Not My Problem" Gap: A failure occurs in the intersection between infra and logic, and both teams wait for the other to act.
  • ! Retention Failure: Sensitive data is kept longer than legally allowed due to a misconfigured lifecycle policy.