Transformation is where raw events become business-usable data. The design uses a layered ELT model as the default: load first into Bronze unchanged, clean and conform in Silver, apply business logic in Gold. Streaming pipelines are the exception: they apply ETL before landing when PII masking or schema validation cannot wait. Each layer transition is a quality gate backed by automated tests, not a copy operation.
Key Takeaways
- 01 ELT by default for analytical workloads. ETL only where transformation must happen before Bronze storage.
- 02 Use dbt for Bronze-to-Silver and Silver-to-Gold SQL transforms. Configure incremental models with a watermark lookback from day one.
- 03 Validate at three points: schema check at ingestion, business rule assertions mid-transform, aggregate reconciliation after each layer promotion.
- 04 Never silently drop bad records. Route to a quarantine table with full error context.
- 05 Every pipeline must be replayable from Bronze. If it cannot replay cleanly, it is not production-ready.
Checklist
- □ dbt project structure defined: sources, staging, intermediate, marts.
- □ All Silver/Gold models use incremental strategy with unique_key and deduplication.
- □ dbt tests defined: not_null, unique, accepted_values, and at least one custom test per domain.
- □ Dataflow pipelines route failures to quarantine, not /dev/null.
- □ Backfill procedure documented and tested against a historical Bronze partition.
- □ Late data SLA defined per domain: how late is too late, what happens to it.
- □ Every Silver/Gold table has a catalog entry with field descriptions and owner.
- □ Breaking changes follow the 3-month deprecation policy with downstream notification.
Transformation layer overview
How streaming and batch paths converge at BigQuery, with quality gates routing failures to quarantine.
ELT vs ETL
The choice between ETL and ELT is not just a sequencing preference. It determines where compute lives, who owns the transformation logic, and what breaks when the architecture scales.
Recommendation: ELT as the default for analytical workloads. ETL for the streaming pre-landing path only.
ELT: load raw, transform in BigQuery
Compute lives in BigQuery. Transformation logic is SQL owned by domain teams via dbt. Raw data in Bronze is preserved permanently, so you can re-run any transform from the source of truth at any time. Schema changes in a dbt model do not require reingesting data. Iterating on business logic is a dbt model change and a CI run, not a pipeline redeployment.
- High-volume analytical workloads on BigQuery or Snowflake.
- When raw data must be preserved for future use cases or GDPR replay.
- When domain teams own transformation logic because SQL lowers the barrier.
- Rapid iteration on business rules without reingestion.
ETL: transform before landing
Compute lives in Dataflow. Transformation logic is Java or Python owned by the platform team. Required when raw data cannot touch storage, primarily for PII masking, tokenisation before Bronze, and schema validation that must reject records before they land. At scale, ETL logic in streaming pipelines is harder to iterate on and harder to test than SQL.
- PII fields that must be masked or tokenised before Bronze storage.
- Schema validation that must reject records before they reach the lake.
- Complex stream-to-stream joins with sub-minute latency requirements.
- Pre-processing for low-latency operational APIs where landing first adds unacceptable lag.
The hybrid reality
Most enterprise platforms use both. Streaming pipelines (Dataflow) apply ETL inline: parse Avro, mask PII, route failures to quarantine, write validated records to BigQuery Bronze. From Bronze onward, everything is ELT: dbt handles Silver and Gold with SQL, BigQuery handles the compute. The boundary is Bronze. ETL sits before it; ELT sits after it.
Layer model
Raw / Immutable
Direct copy of source data with no business logic applied. Write-once, append-only. Retains full historical context for reprocessing. Bronze is the audit log and the recovery source for every downstream layer. Never modify it in place.
Cleaned / Conformed
Standardised formats (dates, currencies, enums), deduplicated, null-handled, and joined into domain-centric entities. Schema is stable and documented. Silver tables are reusable across multiple Gold products. Owned by domain teams via dbt models.
Business-ready
Highly aggregated datasets shaped for specific data products, BI dashboards, or ML feature sets. Business logic applied here, not in the BI tool. Gold tables have SLOs, owners, and change management. They are data products, not ad-hoc query results.
Tool selection
Choose the right tool for each layer. Using Dataflow for batch SQL or dbt for streaming are the two most common over-engineering mistakes.
| Tool | Workload type | Latency | Operational complexity | GCP-native | Use when |
|---|---|---|---|---|---|
| dbt | Batch SQL transforms on BigQuery | Minutes | Low. SQL + Git. No infrastructure to manage. | Yes | Bronze → Silver → Gold analytical transforms. CDC materialisation. Slowly changing dimensions. |
| Dataflow / Beam | Streaming ingestion and real-time enrichment | Sub-minute | Medium. Managed, but pipeline code requires Beam expertise. | Yes | Pub/Sub → Bronze streaming path. Pre-landing PII masking. Exactly-once writes. Stream-to-stream joins with windowing. |
| Spark | Large-scale batch, ML feature engineering | Minutes to hours | High on Dataproc. Lower on GKE Spark Operator with containerised jobs. | Partial | Existing Spark codebase. ML feature pipelines requiring PySpark/MLlib. Heavy joins across datasets too large for BigQuery slot budgets. |
| BigQuery SQL | Batch analytical transforms, ad-hoc aggregations | Seconds to minutes | Very low. Serverless. No infrastructure. | Yes | Scheduled aggregations with predictable cost. MERGE-based incremental loads. Materialized views for pre-aggregated serving. When dbt is overkill. |
dbt
Use dbt as the primary transformation layer on top of BigQuery. dbt turns SQL SELECT statements into version-controlled, testable, documented data models. It owns the Bronze-to-Silver and Silver-to-Gold paths. Dataflow handles the streaming pre-landing path; dbt handles everything after Bronze.
Project structure
Structure dbt projects in four layers. Keep each layer strictly separated: staging models only reference sources, intermediate models only reference staging, marts reference intermediate. Cross-layer references create hidden dependencies that break when the referenced model changes.
sources
Declarations of raw Bronze tables in BigQuery. No logic, just schema, freshness tests, and description. Define once, reference everywhere.
staging
One model per source table. Light cleaning only: rename columns, cast types, apply basic filters. No joins. Materialized as views, which cost nothing and stay fresh.
intermediate
Business logic joins and enrichment. Reusable across multiple marts. Materialized as tables or incremental models depending on volume.
marts
Gold-layer aggregations shaped for a specific data product. Each mart is owned by a domain team. Materialized as incremental tables. SLOs live here.
Incremental models
Configure every Silver and Gold model as incremental from day one. Full refresh is safe at small
scale and catastrophically expensive at large scale. Do not wait until a model starts timing out
to add incremental logic. Retrofitting it onto a model with downstream dependencies is
significantly harder than building it in. Use unique_key with
incremental_strategy = 'merge' to
handle late-arriving updates and CDC corrections. Always include a lookback window wider than
your expected late data window.
-- models/silver/orders.sql
{{
config(
materialized = 'incremental',
unique_key = 'order_id',
incremental_strategy = 'merge',
partition_by = {
'field': 'placed_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by = ['customer_id']
)
}}
with source as (
select
order_id,
customer_id,
amount_gbp,
placed_at,
date(placed_at) as placed_date,
_ingested_at
from {{ source('bronze', 'orders') }}
{% if is_incremental() %}
-- 3-hour lookback covers late-arriving events without a full-table scan.
-- Adjust the window based on your domain's SLA for late data.
where _ingested_at >= (
select max(_ingested_at) - interval 3 hour
from {{ this }}
)
{% endif %}
),
-- Deduplicate within the batch: keep the most recently ingested version per order.
-- This handles re-delivered events and out-of-order CDC rows.
deduplicated as (
select *,
row_number() over (
partition by order_id
order by _ingested_at desc
) as rn
from source
)
select
order_id,
customer_id,
amount_gbp,
placed_at,
placed_date,
_ingested_at
from deduplicated
where rn = 1 dbt tests as the quality gate
dbt tests are the primary data quality gate for batch transforms. Configure them at the schema level so they run automatically after every model build. Failing tests should block promotion to Gold and page the domain owner, not just log a warning.
Built-in tests
Apply to every Silver model: not_null on all required fields, unique on the primary key, accepted_values on enum fields, relationships for foreign key integrity.
Custom tests
Add at least one domain-specific test per mart: record count within expected bounds, revenue totals reconciling within 0.1% of source, no negative quantities. Generic schema tests are necessary but not sufficient.
Slowly changing dimensions (SCD Type 2)
Use dbt snapshots for dimensions that change over time and where history matters: customer
address, product price tier, organisational hierarchy. dbt snapshots apply the SCD Type 2
pattern automatically: add dbt_valid_from and
dbt_valid_to columns,
keep the full history, and expose a current-record view. Configure the snapshot strategy as
check on the columns that
define a change, not timestamp,
since the timestamp strategy breaks when source systems backfill historical records.
dbt Cloud vs dbt Core
Use dbt Core with Cloud Composer (Airflow) if the organisation already operates Airflow and wants orchestration in one place. The Airflow dbt operator runs dbt commands as tasks with full dependency tracking. Use dbt Cloud if the data team wants a managed scheduler, a built-in IDE, and job history without operating Composer. dbt Cloud costs more but eliminates Composer overhead for teams that are only using Airflow for dbt scheduling. Do not use scheduled BigQuery queries or Dataform for complex multi-model dependency graphs because dbt's DAG-based dependency resolution is significantly more maintainable.
Apache Beam / Dataflow
Use Dataflow (Apache Beam on GCP) for the streaming ingestion path: Pub/Sub subscription to Bronze BigQuery table. Dataflow is the right choice here because it is fully managed with autoscaling, guarantees exactly-once processing, and has first-class GCP integrations via native I/O connectors. The Beam programming model is portable: the same pipeline runs on Dataflow for production and on the DirectRunner locally for testing.
Beam programming model
A Beam pipeline is a DAG of transformations applied to PCollections. A
PCollection is an
immutable, potentially infinite dataset. The same abstraction works for a bounded batch file
and an unbounded streaming topic. A
PTransform is a function
from PCollection to PCollection. Compose them with the pipe operator
(|) to build the graph.
The runner (Dataflow, DirectRunner, Flink) is pluggable at launch time.
ParDo + DoFn
Element-wise transform. Use for parsing, filtering, enrichment, and routing. The DoFn.process() method emits zero or more outputs per input.
Tagged outputs
Route valid and failed records to separate PCollections inside the same DoFn. Use this to implement quarantine routing without a second pipeline pass.
Side inputs
Inject a slow-changing reference dataset (e.g., product catalog, customer tier) into a streaming ParDo without a stream-to-stream join. Reload the side input on a schedule or when it changes.
Windowing, triggers, and watermarks
Streaming pipelines that aggregate over time require explicit window definitions. Watermarks tell Beam when to close a window and emit results: the watermark advances as event timestamps progress, and records arriving after the watermark advances are considered late. Configure allowed lateness for every windowed aggregation. Without it, late records are silently dropped.
Fixed windows
Non-overlapping equal-length buckets. Use for hourly or daily aggregations that feed BigQuery partition inserts. beam.window.FixedWindows(60) = 60-second buckets.
Sliding windows
Overlapping windows for rolling metrics. A 10-minute window sliding every 1 minute produces a new result every minute covering the last 10 minutes. Each element belongs to multiple windows.
Session windows
Dynamic windows defined by gaps in activity. The window closes when no event arrives for the configured gap duration. Use for user session analytics where session length is unpredictable.
Flex Templates
Package every Dataflow pipeline as a Flex Template: a containerised job definition stored in Artifact Registry with a metadata JSON file describing its parameters. Flex Templates decouple pipeline deployment from pipeline invocation. CI/CD builds the container and uploads the template; operators or Airflow launch jobs from the template without needing access to the pipeline source code. This is the correct way to deliver reusable ingestion pipelines to domain teams. They supply topic name, target table, and schema version as parameters; the platform team owns the execution logic.
Code: streaming pipeline with quarantine routing
Read from Pub/Sub, parse Avro, route valid records to Bronze and failures to quarantine. The tagged output pattern routes inside a single DoFn pass, with no second pipeline stage needed.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam.io.gcp.bigquery import WriteToBigQuery, BigQueryDisposition
import fastavro, io, datetime
# Load ORDER_AVRO_SCHEMA from your schema registry or a local .avsc file.
# fastavro.parse_schema(json.loads(registry_client.get_latest_schema("orders.placed.v1")))
ORDER_AVRO_SCHEMA = fastavro.parse_schema({ ... }) # abbreviated
BRONZE_SCHEMA = {
"fields": [
{"name": "order_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount_gbp", "type": "FLOAT64", "mode": "REQUIRED"},
{"name": "placed_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
{"name": "_ingested_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
]
}
class ParseOrderEvent(beam.DoFn):
"""Parse Avro bytes. Route valid records and failures to separate output tags."""
def process(self, element):
try:
record = fastavro.schemaless_reader(io.BytesIO(element), ORDER_AVRO_SCHEMA)
yield beam.pvalue.TaggedOutput("valid", {
"order_id": record["order_id"],
"customer_id": record["customer_id"],
"amount_gbp": record["amount_cents"] / 100.0,
"placed_at": record["placed_at_ms"] / 1000.0,
"_ingested_at": datetime.datetime.utcnow().isoformat(),
})
except Exception as exc:
# Never silently drop. Route to quarantine with full error context.
yield beam.pvalue.TaggedOutput("failed", {
"raw_hex": element.hex(),
"error": str(exc),
"failed_at": datetime.datetime.utcnow().isoformat(),
})
def run():
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=options) as p:
results = (
p
| "Read Pub/Sub" >> ReadFromPubSub(
subscription="projects/my-project/subscriptions/orders.placed.v1-bronze"
)
| "Parse Avro" >> beam.ParDo(ParseOrderEvent()).with_outputs("valid", "failed")
)
# Valid records → Bronze table. STREAMING_INSERTS deduplicates on insertId.
(
results.valid
| "Write to Bronze" >> WriteToBigQuery(
"my-project:bronze.orders",
schema=BRONZE_SCHEMA,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
method="STREAMING_INSERTS",
)
)
# Failed records → quarantine table. Full error context retained for investigation.
(
results.failed
| "Write to Quarantine" >> WriteToBigQuery(
"my-project:bronze.orders_quarantine",
schema={"fields": [
{"name": "raw_hex", "type": "STRING"},
{"name": "error", "type": "STRING"},
{"name": "failed_at", "type": "TIMESTAMP"},
]},
write_disposition=BigQueryDisposition.WRITE_APPEND,
)
)
if __name__ == "__main__":
run() max_num_workers on every Dataflow job. Without it, a downstream consumer stuck in
a retry loop can drive autoscaling to hundreds of workers before the alert fires. Cap workers at a
level that still meets your throughput SLO, then alert when the job sustains its cap for more than
10 minutes. That is the signal that something is wrong upstream.
Streaming transformations
Flink vs Dataflow: when each wins
Choose Dataflow when
- You are on GCP and want managed infrastructure with zero ops overhead.
- Latency requirements are above 30ms. Dataflow's internal batching is a trade-off worth making for managed autoscaling.
- Exactly-once delivery into BigQuery is a hard requirement.
- The team wants the same Beam code to run as streaming or batch by changing the runner.
Choose Flink when
- Sub-10ms end-to-end latency is required. Flink processes records individually with no internal batching.
- Complex stateful processing: large keyed state, event-time timers, pattern matching across event sequences.
- You are not GCP-primary and need multi-cloud portability.
- The team has existing Flink expertise and wants full control over checkpointing intervals and state backends.
Stateful stream processing
Stateful transforms maintain per-key state across messages: computing a running total per
customer, detecting a sequence of events, or deduplicating within a sliding window. In Beam,
declare state using BagState,
ValueState, or
MapState annotations on
a DoFn. State is partitioned
by key and lives in the runner's state backend (Dataflow uses Bigtable-backed persistent state
for large state). Use event-time timers to flush state when the watermark advances past a
deadline, not on wall-clock time.
Stream-to-stream joins
Joining two streams requires both streams to have a common watermark strategy. If one stream is consistently faster than the other, the slower stream's watermark lags, holding the join window open and accumulating state. Set an explicit allowed lateness on both sides and accept that records arriving after the cutoff will be processed as late data. For high-cardinality joins (e.g., enriching order events with product catalog data), prefer a side input pattern over a stream join: load the reference data into memory on a refresh schedule and look up during the main stream's ParDo.
Apache Spark
Use Spark for large-scale batch workloads that exceed BigQuery slot budgets, ML feature engineering pipelines that require PySpark or MLlib, and teams migrating from an existing Spark-based architecture. Do not use Spark for incremental analytical transforms that dbt handles better with less operational overhead.
Dataproc vs GKE Spark Operator
Use Dataproc for ephemeral clusters: spin up, run the job, spin down. Dataproc Serverless (Spark on managed containers) eliminates cluster management entirely for jobs with predictable resource requirements. Use the GKE Spark Operator for long-running Spark Structured Streaming jobs where Kubernetes provides better resource isolation and the team already manages a GKE cluster. Avoid persistent Dataproc clusters because they accumulate idle cost and configuration drift.
Spark Structured Streaming
Spark Structured Streaming runs as a micro-batch engine by default: it processes accumulated records in intervals (trigger once every 30 seconds, for example) rather than record by record. This produces higher latency than Dataflow or Flink but simplifies exactly-once semantics through checkpoint-based offset tracking. Use it for near-real-time aggregations where 30–60s latency is acceptable and the team has Spark expertise. Configure checkpointing to Cloud Storage. Set watermarks explicitly. Without them, aggregations on event time accumulate state indefinitely.
BigQuery SQL
For batch analytical transforms where dbt is managing the model graph, use BigQuery SQL inside dbt models. For standalone scheduled jobs and MERGE-based incremental loads outside of dbt, use Dataform or Cloud Composer with scheduled queries. Avoid using raw scheduled BigQuery queries for anything more complex than a simple aggregation because they have no dependency management, no test framework, and no lineage tracking.
Partitioning and clustering as performance levers
Partition every Silver and Gold table by the date column most used in WHERE clauses, typically event date or processing date. Cluster by the columns most used in GROUP BY or JOIN ON clauses. Partitioning eliminates full-table scans; clustering reduces the data scanned within a partition. Both are defined at table creation time in dbt config or Terraform and cannot be changed without recreation. Get them right before the table accumulates data.
MERGE for incremental loads
When running incremental loads outside of dbt, use BigQuery MERGE. The pattern is identical to the dbt incremental model: source is a filtered slice of Bronze with a lookback window and deduplication applied; the MERGE statement upserts into Silver on the primary key.
-- Incremental load via MERGE (raw SQL, for Dataform or Cloud Composer scheduling)
-- Use this pattern when dbt is not managing the pipeline.
MERGE silver.orders AS target
USING (
SELECT
order_id,
customer_id,
amount_gbp,
placed_at,
DATE(placed_at) AS placed_date,
_ingested_at
FROM bronze.orders
-- 3-hour lookback covers late arrivals without scanning the full Bronze table.
WHERE _ingested_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 HOUR)
-- Deduplicate: one row per order_id, latest ingestion wins.
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY _ingested_at DESC
) = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
amount_gbp = source.amount_gbp,
placed_at = source.placed_at,
_ingested_at = source._ingested_at
WHEN NOT MATCHED THEN INSERT (
order_id, customer_id, amount_gbp, placed_at, placed_date, _ingested_at
) VALUES (
source.order_id, source.customer_id, source.amount_gbp,
source.placed_at, source.placed_date, source._ingested_at
); Materialized views for the serving layer
Create BigQuery materialized views on top of Gold tables for high-traffic BI dashboards. Materialized views are pre-computed and cached; BigQuery refreshes them automatically when the base table changes. They eliminate repeated full-table scans on large aggregation queries. Use them for the top-level metrics that Looker dashboards query on every page load, not for exploration queries that change shape frequently.
Data quality and validation
Validate at three distinct points in the pipeline. A single quality gate at the end of the pipeline finds problems too late. A single gate at ingestion catches schema violations but misses business rule failures that only appear after transformation.
Point 1: pre-transform schema check
Validate Avro/Protobuf schema at ingestion time in Dataflow before writing to Bronze. Reject messages with unknown fields or wrong types. Route to quarantine with the raw bytes, the schema version attempted, and the validation error. Never write a structurally invalid record to Bronze because it will corrupt every downstream transform that processes that partition.
Point 2: mid-transform business rule assertions
Apply dbt tests and inline Dataflow assertions against business rules: order amounts must be positive, customer IDs must exist in the customer dimension, timestamps must fall within a reasonable range (not year 1970, not year 2100). In Dataflow, use a second tagged output to route records that fail business validation to a quarantine table rather than failing the entire pipeline.
Point 3: post-transform aggregate reconciliation
After each layer promotion, verify that aggregate counts and sums reconcile with the layer above. The record count in Silver for a given date partition should be within a defined tolerance of the Bronze record count for the same window. Revenue totals should reconcile within 0.01% across layers. Configure these as custom dbt tests or as Cloud Monitoring metric assertions. Block Gold publication when reconciliation fails.
Incremental vs full refresh
Full refresh rebuilds the entire table from source on every run. It is safe, simple, and completely non-viable at scale. Use full refresh only for small dimension tables under 10 GB where the simplicity of not managing a watermark outweighs the cost of rebuilding. For everything else, incremental is mandatory.
Incremental requirements
- A reliable watermark column: an ingestion timestamp (
_ingested_at) is more reliable than an event timestamp, which sources can backfill. - Idempotent merge logic: running the same incremental window twice must produce the same output.
- A lookback window that covers your late data SLA. Start at 3 hours and widen if needed.
- A unique_key for deduplication when the same business event can appear multiple times.
When incremental breaks
- Schema changes: adding a non-nullable column to an incremental model requires a full refresh because the existing rows do not have the new column.
- Retroactive corrections: a source system corrects records older than the lookback window. Those corrections are missed unless you widen the window or trigger a targeted backfill.
- Deleted records: incremental merge only handles INSERT and UPDATE. Soft deletes in the source must be propagated explicitly via a
deleted_atflag.
Late-arriving data
"Late" means a record whose event timestamp falls before the current watermark. In this architecture, most late data arrives within minutes but occasionally arrives days later, such as a mobile app that was offline, a batch file retransmitted after a network failure, or a CDC backfill from a source system maintenance window. Define the late data SLA per domain before building the pipeline, not after encountering late records in production.
Strategies
Every incremental model uses a lookback window (e.g., 3 hours) that is wider than the expected late data arrival. Records arriving within the window are automatically included in the next run. Simple, reliable, and sufficient for most use cases.
For streaming aggregations, configure withAllowedLateness(Duration.standardMinutes(30)) on the window. Beam will hold the window open and fire a speculative pane when late records arrive. Each pane updates the BigQuery row via MERGE semantics. Set the accumulation mode to ACCUMULATING if the downstream consumer cannot handle delta corrections.
Records arriving more than 24 hours late require a manual or scheduled targeted backfill. Define the procedure in the pipeline runbook: which Bronze partition to reprocess, which Silver/Gold models to rebuild, how to verify the output. Test this procedure in a Game Day before you need it in an incident.
Operational decision: defining the SLA
Define three thresholds for each domain. Records within the on-time window (e.g., under 15 minutes) are processed normally by the incremental pipeline. Records within the late window (e.g., 15 minutes to 24 hours) are caught by the lookback and processed on the next run, with no action required. Records outside the correction window (e.g., beyond 24 hours) require a targeted backfill, trigger an alert, and are logged in the incident register. Document these thresholds in the data product specification. Consumers need to know the maximum correction lag when building business reports.
Orchestration
Treat pipelines as code. Every transformation job is version-controlled, has a defined trigger, and declares its upstream dependencies explicitly. Ad-hoc scripts run manually by an engineer are not pipelines. They are technical debt.
Incremental loads
Schedule dbt runs against a lookback window, not against a full table. Use dbt's --select flag to run only affected models when a source schema changes, not the full project.
Dry-run mode
Every transformation pipeline must support a dry-run flag: run the full logic, validate the output, but do not write to production tables. Use dry-run in CI to catch logic errors before deployment.
Dependency tracking
Use Cloud Composer (Airflow) for cross-system orchestration: trigger dbt after a Dataflow job completes, trigger a Spark job after a GCS file arrives. Use dbt's own DAG for intra-dbt dependencies.
Backfill automation
Design backfills as parameterised pipeline runs with explicit date range inputs, not as one-off scripts. A backfill that cannot be triggered from the same CI/CD toolchain as a normal run is a manual process pretending to be automated.
Publishing rules
- Quality gates Data must pass all dbt tests and aggregate reconciliation checks before the Gold model is marked visible to consumers. Failing tests block the model from publishing. They do not produce a warning and continue.
- Documentation requirement Every new table in Silver or Gold requires a catalog entry with: field descriptions, the business definition of the primary key, the SLO (freshness, completeness), and the owner on-call contact. Missing documentation blocks the pull request.
- Deprecation policy Breaking changes require a 3-month sunset period. Notify all downstream owners at deprecation announcement, at 30 days, and at 7 days before removal. The schema registry records the deprecation date.
Key takeaways
- 01 Use ELT by default. Dataflow applies ETL inline before Bronze only for PII masking and schema validation. Everything after Bronze is dbt and BigQuery SQL.
- 02 Configure dbt incremental models with a watermark lookback and explicit deduplication from day one. Adding incremental logic to a large model after the fact requires a full rebuild and downstream coordination.
- 03 Route every failed record to a quarantine table with full context. Silent drops are the most expensive failure mode: no errors, no alerts, and the data consumers find the discrepancy during a business review.
- 04 Validate at three points: schema at ingestion (Dataflow), business rules mid-transform (dbt tests + inline Beam assertions), aggregate reconciliation after layer promotion. A single gate at the end is too late.
- 05 Define the late data SLA per domain before building the pipeline. On-time window, late window, correction window. Consumers need to understand the maximum correction lag when building business reports.
- 06 Cap Dataflow worker counts. Autoscaling without a maximum is a budget risk, not a feature. Set the cap at the level that meets your throughput SLO, then alert when the job sustains the cap.
Failure modes
- ! Silent record drops at ingestion. A Dataflow pipeline catches a parsing exception and swallows it, with no dead letter routing configured. Bronze accumulates records but has gaps. Gold metrics look normal. A business analyst notices anomalous figures six weeks later. No audit trail, no raw bytes, no error context.
- ! Full refresh on a 500 GB Silver table. A model that was configured as full refresh during development is promoted to production. At 200 GB it runs in 8 minutes. At 500 GB it times out. Retrofitting incremental logic requires rebuilding the model, coordinating a schema migration, and reprocessing historical data, all a multi-day effort.
- ! Subtle data drift. Transformation logic is unchanged but source data distribution shifts: a currency conversion field starts returning null for a new region, driving all revenue metrics for that region to zero. No schema violation, no dbt test fails. The drift goes undetected until a regional review catches it.
- ! Breaking a downstream column without notice. A developer renames a column in a Silver model. The PR passes dbt tests because no Silver test references the old name. Three Gold models and two Looker views break silently. The failure surfaces the next morning when dashboards show null values.
- ! Logic fragmentation into BI tools. A "quick fix" revenue calculation is added directly in a Looker LookML measure instead of in dbt. Six months later, two teams are running the same metric with different results. One uses the dbt definition; the other uses the Looker override. Neither team knows the other definition exists.
- ! Stateful pipeline accumulates unbounded state. A Dataflow streaming job maintains per-customer running totals without a state expiry timer. Over six months the state backend grows to hundreds of GB. Checkpoint time exceeds the maximum allowed, causing the job to restart. Each restart triggers a full state recovery, compounding the problem.