SYSTEM_CONSOLE v2.4.0

Data Ingestion

LAST_UPDATED: 2025-07

Getting data into the platform correctly is the hardest part to fix later. The design here is event-first: business changes publish to the message bus with strict schema enforcement, and everything that can go wrong has an explicit handling path before it touches Bronze storage.

Key Takeaways

  • 01 Prefer domain events for new services, CDC for legacy.
  • 02 Ensure idempotency to handle retries and replays safely.
  • 03 Strict schema versioning prevents downstream breaking changes.
  • 04 Automated DLQ management for operational resilience.

Checklist

  • Contract defined, versioned, and stored in a registry.
  • Idempotency strategy documented for each source.
  • DLQ + alerting configured with clear ownership.
  • Replay runbook exists and has been tested (Game Day).

Ingestion modes

Domain Events

"Something happened in the business."

Applications publish structured events (e.g., OrderPlaced) directly to a message bus. These carry explicit business intent rather than raw database state.

Change Data Capture

"Something changed in the database."

Streaming changes directly from database logs. Useful for legacy systems where application code cannot be easily modified.

Batch Ingestion

"Here is everything from yesterday."

Scheduled loads of large files or API exports. Used for SaaS providers or systems that do not support streaming.

Patterns

Outbox pattern

Publishing directly to a message bus inside a database transaction is a race condition waiting to happen. The Outbox pattern avoids this: the event is written to a local 'outbox' table in the same transaction as the business change, then a separate relay process reads from the outbox and publishes. If the relay fails, the event stays in the outbox until it succeeds.

Best practice
Always use a local outbox to maintain atomicity between state changes and event notifications.

Idempotent consumers and deduplication

Network failures can cause events to be delivered multiple times. Ingestion pipelines must be idempotent, meaning processing the same event twice results in the same state.

  • Use a unique Business ID (e.g., Order ID + Version) as the primary key in the target storage.
  • Maintain a 'seen_events' log for a sliding window of time.

Schema versioning and compatibility

Schemas will evolve. Use a schema registry to enforce compatibility rules.

Backward Compatible

New code can read old data. (e.g., adding an optional field)

Forward Compatible

Old code can read new data. (e.g., deleting an optional field)

Dead-letter queues and poison messages

Messages that fail validation or processing should not block the pipeline. Move them to a Dead-letter Queue (DLQ).

DLQ pattern

Operational guidelines

  • Naming conventions Pattern: domain.subdomain.event_type.version (e.g., sales.orders.placed.v1)
  • Required metadata tags Every event must include: correlation_id, timestamp_utc, source_system, domain_owner.
  • Retry policy Exponential backoff (e.g., 1s, 10s, 60s, 5m) before moving to DLQ.
GCP mapping
Pub/Sub (Messaging), Datastream (CDC), Dataflow (Processing), Cloud Storage (Batch Landing), BigQuery (Raw Tables).

Failure modes

  • ! Duplicate Events: Lack of idempotency causes double-counting in business reports.
  • ! Late Events: Out-of-order events causing incorrect state if not handled by event timestamps.
  • ! Schema Breaking: Producers change field types without a major version bump, crashing consumers.
  • ! Uncontrolled Replays: Re-ingesting massive amounts of data without pausing downstream triggers, causing cost explosions.