A data platform without cost controls and clear operational ownership will fail in production, regardless of how well the architecture is designed. The operating model here defines which team owns which failure, and the cost guardrails define what happens when a pipeline misbehaves at scale.
Key Takeaways
- 01 Clear ownership boundaries between Platform and Domain teams.
- 02 Automated cost allocation via domain-specific tagging.
- 03 Retention policies to manage storage lifecycle and costs.
- 04 Operational readiness reviews (Game Days) for reliability.
Checklist
- □ Cost allocation tags in place for all cloud resources.
- □ Replay and backfill limits defined to prevent cost spikes.
- □ Ownership model published and agreed upon by all stakeholders.
- □ Storage retention policies configured for Bronze/Silver/Gold.
Operating model
Platform Team
- Infrastructure uptime (BigQuery, Pub/Sub, GKE).
- IAM & Security guardrails.
- CI/CD pipeline templates.
- Centralized observability tooling.
Domain Team
- Data product quality and freshness.
- Transformation logic and bug fixes.
- Domain-specific cost management.
- Incident response for data logic failures.
Cost controls
Cloud data platforms can scale infinitely, and so can their costs. Implement multi-layered controls:
Set hard and soft limits at the GCP Project / Domain level. Alerting at 50%, 80%, and 100% of monthly budget.
Every resource must have a 'domain' and 'environment' tag for automated billing export and chargeback.
Bronze data (raw) kept for 7 years (compliance). Silver/Gold transient data kept for 90 days unless specified otherwise.
Reliability practices
Game Days
Periodic simulation of failures (e.g., source DB goes offline, schema breaks) to test team response and runbooks.
Operational Readiness
Before a data product goes to 'Production', it must pass a review of its monitoring, runbooks, and cost estimates.
Failure modes
- ! Cost Explosion: A recursive logic error in a transformation pipeline causes massive compute consumption overnight.
- ! Orphaned Resources: Temporary tables or staging files are never deleted, leading to slowly creeping storage costs.
- ! The "Not My Problem" Gap: A failure occurs in the intersection between infra and logic, and both teams wait for the other to act.
- ! Retention Failure: Sensitive data is kept longer than legally allowed due to a misconfigured lifecycle policy.