SYSTEM_CONSOLE v2.4.0

Schema Design and Governance

LAST_UPDATED: 2025-08

Schema is not a technical detail. It is the contract between every producer and consumer in an event-driven system. In a monolith, a broken data structure fails at compile time or in a test. In an event-driven platform, a broken schema propagates silently through the pipeline, corrupting dashboards, starving consumers, and surfacing in an audit weeks after the damage was done. Schema enforcement is non-negotiable in a production event-driven platform.

Key Takeaways

  • 01 Schema is the API contract between domains. Treat it with the same rigour.
  • 02 Validate at the producer before the message enters the topic, not after.
  • 03 Use FULL compatibility as the default for all production topics.
  • 04 Breaking changes require a new topic version, not a hotfix.
  • 05 Store schemas in version control alongside the service that owns them.
  • 06 Never silently drop a message that fails schema validation.

Checklist

  • Schema registry deployed and accessible by all producers and consumers.
  • FULL compatibility mode configured for all production topics.
  • Schema compatibility check integrated as a required CI/CD step.
  • All schemas stored in version control with owner, version, and deprecation policy.
  • DLQ configured to capture schema validation failures with raw bytes and schema ID.
  • Breaking change migration runbook documented and reviewed.

Why schema matters in event-driven systems

In a monolith, a broken data structure fails fast. The compiler catches a type mismatch. A unit test catches a missing field. The developer sees the error within seconds of making the change. The blast radius is contained.

In an event-driven platform, domains communicate through messages on a topic. The producer and consumer are deployed independently, often by different teams. There is no shared compilation step. When a producer changes a field from string to integer and deploys without a version bump, the consumer deserializes every subsequent message as a fatal error, or worse, silently coerces the value and produces corrupt results downstream. The source team has moved on. The problem surfaces in a business report three days later.

Events are the API between domains. Schema is the contract that makes that API stable. Without a schema registry enforcing compatibility, every deployment is a potential silent breaking change.

Failure modes without schema enforcement

  • ! Producer adds a field, consumer breaks silently. Consumers relying on a fixed schema either deserialize the new field as unknown and ignore it, or throw an exception depending on the framework. No alert fires because the pipeline still "succeeds."
  • ! Producer removes a required field, downstream aggregations return nulls. No error is thrown. The null propagates into BigQuery tables and surfaces when a business analyst questions why revenue figures dropped to zero for a region.
  • ! Type change causes deserialization failure hours after deployment. A producer changes amount from string to float. The consumer deployment was staged. By the time the consumer picks up the type mismatch, thousands of messages are in the DLQ.
  • ! No audit trail of what changed, when, and who approved it. Six months after a schema drift incident, no one can reconstruct which version of the schema was active when the corrupted data was produced.

Schema format comparison

Avro

"The Kafka-native choice."

Schema is defined in JSON and registered separately. Binary encoding with schema embedded in the file or resolved via registry ID. Excellent Kafka and Hadoop ecosystem support. Schema evolution is first-class. Requires a schema registry to decouple schema from payload in production. Most tooling in the GCP/Kafka ecosystem targets Avro first.

Protobuf

"The cross-language choice."

Language-agnostic IDL compiled to typed classes in every major language. Smaller binary payload than Avro. Strong typing with explicit field numbers, where field number reuse is a breaking change. Well-suited for multi-language systems and gRPC-adjacent architectures. Steeper onboarding curve than Avro for data engineering teams.

JSON Schema

"The REST API boundary choice."

Human-readable, no binary encoding, and validation only. It does not define serialization. High payload overhead at volume. Schema evolution support is minimal compared to Avro or Protobuf. Appropriate at the API boundary (REST webhooks, third-party integrations) but not inside a streaming pipeline.

Format Encoding Payload size Schema evolution Tooling maturity GCP native support Recommended for
Avro Binary Small Strong High (Kafka-native) Yes (Pub/Sub, Dataflow) Kafka / Pub/Sub streaming pipelines
Protobuf Binary Smallest Strong High (gRPC-native) Yes (Pub/Sub, Dataflow) Multi-language or gRPC-adjacent systems
JSON Schema Text (JSON) Large Limited High (REST-native) Partial (Pub/Sub only) API boundaries, third-party webhooks
Best practice
Use Avro for Kafka-based or Pub/Sub streaming pipelines. Use Protobuf for multi-language services or systems with gRPC dependencies. Use JSON Schema only at the API boundary, never inside the streaming pipeline.

Schema registry architecture

A schema registry stores schemas, assigns unique numeric IDs, enforces compatibility rules, and decouples the schema definition from the message payload. The producer registers a schema before publishing. Every outbound message carries a schema ID in its header, not the full schema. The consumer fetches the schema by that ID on first use and caches it locally.

Schema registry in the platform flow

Confluent Schema Registry

The de facto standard for Kafka ecosystems. REST API, supports Avro, Protobuf, and JSON Schema. Compatibility modes configurable per-subject. Available as managed service or self-hosted. Widely supported by Kafka clients and GCP Dataflow connectors.

Apicurio Registry

Open source, Confluent API-compatible (can be used as a drop-in replacement). Supports Kafka as a storage backend, removing the need for a separate database. Better fit for non-Confluent deployments or organisations with open-source procurement constraints.

Google Pub/Sub Schema

Native Avro and Protobuf validation at the topic level. Simpler than a dedicated registry, with the schema attached to the topic directly. Limitation: no schema ID is embedded in the message header, so consumers cannot resolve the schema version independently. Not suitable as the sole schema registry for a multi-topic platform.

Risk
Google Pub/Sub's native schema support validates message format but does not embed a schema ID in the message payload. Consumers have no way to determine which schema version was used to encode a given message. For replay scenarios or cross-system consumers, a dedicated registry (Confluent or Apicurio) is required.

Schema evolution and compatibility

Schema evolution is not optional. Schemas will change. The question is whether those changes are controlled. A schema registry enforces compatibility rules before a new schema version is accepted. Understand the four modes:

BACKWARD

New schema can read data written with the old schema. Consumers upgrade first, then producers. Safe if you only add optional fields with defaults or remove fields consumers do not use.

FORWARD

Old schema can read data written with the new schema. Producers upgrade first, then consumers. Safe if you add optional fields that older consumers can ignore.

FULL, the recommended default

Both backward and forward compatible. Producers and consumers can upgrade in any order. The most conservative and operationally safest mode. Use this for all production topics.

NONE, never use in production

No compatibility check. Every schema version is accepted regardless of breaking changes. Equivalent to having no schema registry at all. Only appropriate for local development or throwaway topics.

Non-breaking changes

These changes are safe to deploy without a version bump or consumer coordination:

  • Adding an optional field with a default value.
  • Adding a new enum value at the end of the enum definition (Avro).
  • Adding a new optional message field (Protobuf) without reusing an old field number.

Breaking changes

These changes require a new topic version and a migration plan:

  • Removing a field that consumers depend on.
  • Renaming a field, which is semantically equivalent to a remove and add.
  • Changing a field type (string to integer, integer to float).
  • Reordering Protobuf field numbers.
  • Adding a required field without a default to an existing schema version.

Handling a breaking change

When a breaking change is genuinely necessary, do not push it as a patch to the existing topic. The migration process is:

  1. Create the new schema version in the registry and verify it fails the compatibility check for the existing version.
  2. Create a new topic version (e.g., sales.orders.placed.v2).
  3. Update producers to dual-write to both the old topic and the new topic.
  4. Migrate consumers to the new topic one by one.
  5. Set a deprecation date for the old topic and communicate it to all consumers.
  6. Remove the dual-write from producers after all consumers have migrated.
  7. Archive the old topic after the sunset date.

Enforcing schemas at the producer side

Schema validation must happen before the message enters the topic. Validating on the consumer side, or relying on the broker to reject invalid messages, pushes the error downstream because the message has already been produced, which means the failed message counts as traffic, consumes offset, and may partially process before failure.

Validate first

Before calling the Kafka or Pub/Sub SDK to publish, serialize the event object against the schema from the registry. If validation fails, reject the message locally and surface the error to the calling service. Never swallow the error and publish anyway.

Schema ownership

The team that owns the domain owns the schema. A schema change is a code change. It requires a pull request, a reviewer, and a registry update, not a Slack message and a hotfix. Treat a schema change the same way you treat an API contract change.

CI/CD gate

Add a schema compatibility check as a required step in every service's deployment pipeline. The step fetches the latest schema version from the registry and runs a compatibility check against the proposed new schema. If the check fails, the deployment is blocked. Breaking changes require an explicit topic version bump to proceed.

Best practice
Never publish an unvalidated event. A producer that bypasses schema validation is a producer that will eventually corrupt downstream data silently.

Schema validation at the consumer side

Even with producer-side enforcement, consumers must validate on deserialization. The registry may return a newer schema version than the consumer has cached. Unknown fields should be handled explicitly, either ignored or logged, never silently dropped without a record.

When a consumer receives a message it cannot deserialize, the correct action is:

  1. Capture the raw bytes, schema ID, topic, partition, and offset.
  2. Capture the deserialization error message and stack trace.
  3. Route the message to the DLQ with all of the above as metadata.
  4. Do not retry a deserialization failure automatically. It will fail again. Alert on it and investigate.
Risk
Never silently drop a message that fails schema validation. Dropping is data loss. Route to DLQ and alert. The failure reveals a producer-side schema change that bypassed the CI/CD gate, which is itself an incident.

Breaking vs non-breaking: decision tree

Additive change with a default value? Non-breaking. Deploy freely. No consumer coordination required.
Additive change without a default? Breaking under BACKWARD mode. Consumers must be updated to handle the new required field before producers deploy. Consider adding a default to make the change non-breaking instead.
Rename or removal of an existing field? Always breaking. Requires topic versioning and a full dual-write migration.
Type change on an existing field? Always breaking, regardless of how similar the types appear. A string-to-integer change that seems safe will cause silent truncation or deserialization failures in at least one consumer. Requires topic versioning.
Protobuf field number change or reuse? Always breaking. Never reuse a Protobuf field number, even if the old field was removed. Mark removed fields as reserved.

Schema governance and ownership

Schema as code

Store all schemas in version control inside the repository of the service that owns them. The schema file lives alongside the event definition code. When someone changes the schema, the diff is visible in the pull request. The PR review enforces the change process. The registry is the deployment target, not the source of truth.

Ownership model

The producing domain owns the schema contract. Consumers are notified of changes but do not control them. This mirrors how REST API contracts work: the service that exposes the API owns its definition; clients adapt to versioned changes.

Non-breaking change approval

Requires team lead approval and a PR to the owning service's repository. Registry is updated as part of the deployment. No consumer notification required but recommended.

Breaking change approval

Requires an architectural review, a new topic version, and written notification to all known consumers with a migration timeline. The old topic must have a defined sunset date before the breaking change proceeds.

Schema documentation requirements

Every field in every schema must have:

  • A description explaining the business meaning of the field, not its technical type.
  • The unit of measure where applicable (e.g., "amount in minor currency units, e.g., pence").
  • Any valid range or enum values with their business meaning.

Every schema must have:

  • An owner (team or individual with a contact path).
  • A current version and a changelog since version 1.
  • A deprecation policy stating when and how old versions will be retired.

Naming conventions

Event names

Past tense, PascalCase. Something that happened: OrderPlaced, PaymentFailed, InventoryReserved. Never present tense (PlaceOrder is a command, not an event).

Field names

snake_case, no abbreviations. Write customer_id not cust_id. Write order_placed_at not ts. Abbreviations become tribal knowledge that breaks when people leave.

Topic names

Pattern: domain.subdomain.event_type.version. Example: sales.orders.placed.v1. Version in the topic name is the major version. Minor and patch versions are handled within the schema registry.

What breaks without schema enforcement

These are not hypothetical. Each of the following failure modes has caused production incidents at organisations that skipped schema governance, discovered the cost later, and then built it in retrospectively at 10x the effort.

Scenario What happens How it surfaces Time to detect
Silent null propagation Producer removes a field that downstream aggregations group on. Consumers receive null, continue silently. Revenue dashboard shows a region with zero revenue. Analyst raises a query ticket. Days to weeks
Deserialization cascade Type change causes consumer to throw on every message. Consumer group lag grows. Downstream pipeline starves. DLQ spike alert fires. On-call investigates consumer lag. Root cause requires binary inspection of old messages. Hours (visible) but days (resolved)
Schema drift After 18 months without enforcement, the same field name means different things in three different schemas. No two consumers agree on what it means. Data reconciliation project. Two teams argue about whose interpretation is correct. Neither can prove it. Months
Undetected data loss Producer silently truncates a string field (e.g., postal code) to fit a legacy system constraint. Data lands in BigQuery corrupted but structurally valid. Detected during a compliance audit. Data cannot be recovered because the raw bytes were not preserved in Bronze. Audit cycle (months to years)
GCP mapping
Confluent Schema Registry (self-hosted on GKE or Confluent Cloud), Apicurio Registry (open source on GKE), Google Pub/Sub Schemas (Avro/Protobuf topic-level validation), Cloud Artifact Registry (schema file storage alongside service containers), Cloud Build (CI/CD schema compatibility gate).

Key takeaways

  • Schema is an API contract Treat every schema change with the same process you apply to a public API change: review, versioning, consumer notification, and a deprecation path for breaking changes.
  • Validate at the producer, not the consumer Producer-side validation catches the problem before it enters the pipeline. Consumer-side validation is a safety net, not the primary control.
  • FULL compatibility is the only safe default for production BACKWARD and FORWARD modes both allow deployment sequences that can briefly put producers and consumers on incompatible versions. FULL eliminates that window entirely.
  • Pub/Sub native schemas are not a registry substitute Topic-level schema validation in Pub/Sub is useful but does not embed a schema ID in the message header. Without schema ID tracking, replay and cross-system consumption become undecipherable.
  • Schema as code enforces the governance model Schemas stored in version control, reviewed via pull requests, and deployed as part of the CI/CD pipeline are the only way to maintain an audit trail and enforce change ownership at scale.
  • Breaking changes require a migration plan, not a hotfix Dual-write, topic versioning, and a sunset date are the minimum viable process for a breaking schema change. Skipping any step creates stranded consumers, unrecoverable data loss, or both.

Failure modes

  • ! NONE mode in production: A developer sets compatibility mode to NONE to unblock a deployment. It stays that way. Six months later, the topic has 12 incompatible schema versions and no consumer can replay historical data.
  • ! Schema registry becomes a single point of failure: Producer-side validation calls the registry synchronously. Registry goes down during a deployment window. All producers fail to publish. A caching layer with a fallback to the last-known schema is required.
  • ! Schema drift via documentation-only governance: The registry enforces nothing, so schemas are registered as documentation after the fact. Actual messages diverge from registered schemas within weeks. The registry becomes a lie.
  • ! Field description rot: Schemas are defined once and never updated. A field's meaning changes over time but the description still says "legacy value, do not use." No one knows what the field actually means. Queries return wrong results.
  • ! Dual-write abandoned mid-migration: A breaking change migration starts. The old topic is not sunset on schedule. Six months later, two producers dual-write, three consumers are on the new topic, two are still on the old. The old topic is never fully retired.
  • ! Consumer silently drops unknown fields: The consumer deserializer is configured to ignore unknown fields. A critical new field added by the producer is silently ignored for weeks. The producer team assumes it is being consumed. The consumer team does not know it exists.