Architecture Documentation | Event-driven Enterprise Data Platform

Architecture documentation is the memory of the platform. This page defines a documentation strategy that serves both consultants delivering a formal handover and internal teams maintaining a long-lived platform. We treat documentation as code: version-controlled, peer-reviewed, and stored alongside the implementation it describes.

What to document and what to skip

Most architecture documentation fails because teams document everything or nothing. Effective documentation focuses on decisions, not implementations—the code itself documents the implementation. Document the why, not just the what. Future engineers need to understand the reasoning to avoid relitigating the same trade-offs.

Must always be documented

Architectural decisions with trade-offs considered (ADRs)
Component boundaries and domain ownership
Data contracts between domains
Non-functional requirements (NFRs) and architectural alignment
Known limitations and accepted technical debt
Runbooks for operational incidents

Skip (Do not document)

Internal implementation details that change frequently
Anything expressed clearly in code or configuration
Diagrams that duplicate what is already in the codebase

Rule of thumb: if a new senior engineer joining the team would ask the question, document the answer.

LeanIX-style documentation

A LeanIX-style approach uses a structured fact sheet per architecture component. This is a documentation pattern, not a product dependency; it works with or without the LeanIX tool. Each fact sheet provides a snapshot of a component's business value, technical health, and risk profile.

Fact Sheet Structure

Component Name: Unique identifier for the system or service.
Business Capability: The specific business function served (e.g., "Order Fulfillment").
Technical Owner: The team accountable for maintenance (not an individual).
Lifecycle Status: Planned / Active / Deprecated / Decommissioned.
Deployment Model: GKE / Cloud Run / Managed Service / etc.
Dependencies: Upstream (consumed) and Downstream (consumers).
Data Classification: Public / Internal / Confidential / Restricted.
SLA: Availability target, RTO, and RPO.

Example: Pub/Sub Ingestion Layer

Component Name:	platform-ingestion-pubsub
Business Capability:	Real-time Data Ingestion
Technical Owner:	Data Platform Team
Lifecycle Status:	Active
Deployment Model:	GCP Managed Service (Global)
Upstream:	Source Domain Connectors (Orders, Inventory)
Downstream:	Dataflow Processing Jobs, BigQuery Subscriptions
Data Classification:	Internal (contains PII in encrypted fields)
SLA:	99.95% Availability; RTO: 15m; RPO: 0 (replicated)
Known Risks:	Regional outage could delay ingestion; cost spikes on high volume.

For Consultants

Deliver one fact sheet per major component as part of the architecture handover package to ensure the client has a clear asset inventory.

For Internal Teams

Store fact sheets in the same repository as the component. Review quarterly to ensure metadata reflects the current state.

Architecture Decision Records (ADRs)

ADRs prevent the "re-litigation" of technical choices. Decisions made without documentation are challenged every six months by people who were not in the room. ADRs capture the constraints and context of a specific moment in time.

ADR Template

ADR_TEMPLATE.md

# ADR-XXX: [Title in present tense]

## Status

Proposed / Accepted / Deprecated / Superseded by [ADR-YYY]

## Context

What situation forced this decision? What were the technical, financial, or time constraints?

## Decision

What was decided? State clearly without hedging.

## Alternatives Considered

- Alternative A: Why it was rejected (be honest).

- Alternative B: Why it was rejected.

## Consequences

- What becomes easier?
- What becomes harder?
- What is now a constraint for future decisions?

## Review Date

YYYY-MM-DD (When to revisit this decision)

Example: Kafka vs. Pub/Sub

ADR-004: Kafka vs. Pub/Sub ACCEPTED

Context

We are moving to a multi-domain data mesh. Without schema enforcement, downstream consumers break frequently due to upstream field renames and type changes. Protobuf was considered but our existing dbt and BigQuery tooling has better native support for Avro.

Decision

All events published to Kafka must use Avro serialization. No message can be produced without a registered schema in the Confluent Schema Registry.

Alternatives

JSON Schema: Rejected due to larger payload size and lack of strict type safety in some target systems.
Protobuf: Rejected because BigQuery external table support for Protobuf on GCS was less mature than Avro at the time of decision.

Consequences

Producers now have a build-time dependency on the schema registry. Schema evolution must follow backward-compatibility rules. Payload sizes are reduced by 40% compared to JSON.

Review Date

2026-12-01

When to write an ADR

Decisions affecting more than one team
Decisions that are hard to reverse
Decisions where reasonable engineers would disagree
Decisions made under significant constraint

When NOT to write an ADR

Routine choices within a single team's boundary
Decisions that can be changed without downstream impact

Management

Store ADRs in version control alongside the architecture (e.g., a /docs/adr folder), never in Confluence. Never delete an ADR; deprecate or supersede it. Link superseding ADRs to the records they replace. Review all ADRs annually as part of architecture governance.

Connecting documentation to platform components

Documentation is only useful if it is findable and linked to the resources it describes. Every major platform component in this blueprint maps to a specific set of documentation artifacts.

Component	Required Artifacts	Key ADR Focus
Pub/Sub / Kafka	Fact Sheet, Runbook, Data Contract	Streaming backbone selection
Dataflow	Fact Sheet, Runbook	Stream processing engine selection
BigQuery	Fact Sheet, Data Contract	Analytical storage selection
Schema Registry	Fact Sheet, ADR	Schema enforcement approach
dbt	Fact Sheet, ADR	Transformation layer selection
Looker	Fact Sheet, ADR	Visualisation layer selection

How to document data contracts

A data contract is a formal agreement between a producer domain and its consumers. It moves beyond "hope-based" integration to explicit guarantees on schema, reliability, and ownership.

Data Contract Structure

Contract Name/Version: Unique ID and semantic version.
Producer: Team and service producing the data.
Consumers: List of known consumer teams and use cases.
Schema: Link to the schema registry or definition file.
SLA: Latency, availability, and completeness targets.
Breaking Change Policy: Notice period (e.g., 30 days) and sunset timeline.
Owner: Team with write access to this specific contract.

Example: Logistics Domain Contract

VehicleTelemetry-v1.0.0.contract

Producer

Fleet-Management-System (Domain: Logistics)

Topic

gcp.prd.logistics.v1.vehicle-telemetry

Schema

https://registry.internal/logistics/vehicle-telemetry-v1.avsc

SLA

99.9% delivery within 1 second of ingestion.

Consumers

Analytics-Domain (Fleet-Performance-Report), Maintenance-Domain (Predictive-Service)

Policy

Major version changes require 60 days notice via #logistics-dev.

Owner

Team-Atlas (Logistics Platform)

Treat data contracts as code. Store them in version control; a contract change should be a Pull Request requiring approval from consumer teams if it affects their downstream processing.

RACI matrix for documentation ownership

Documentation without ownership decays. Within six months, it becomes stale and untrusted. Every document type must have a clear owner role responsible for its accuracy.

Artifact	Platform Arch	Domain Tech Lead	Eng Team	Enterprise Arch	Product Owner
ADRs	A	R	C	I	I
Fact Sheets	C	A	R	I	I
Data Contracts	I	A	R	I	C
Runbooks	I	I	A/R	-	I
Diagrams	A	C	R	I	-

Responsible | Accountable | Consulted | Informed

Every document has exactly one Accountable owner (a role, not a person).
Consulted parties must respond within 3 business days.
Review Cadence: ADRs (Annually), Fact Sheets (Quarterly), Data Contracts (on change), Runbooks (after incident).

Versioning and change management

Architecture documentation lives in version control—not Confluence or SharePoint. Changes should be reviewable, reversible, and linked to the code they describe.

Branching Strategy

Documentation changes accompanying code go in the same Pull Request. Documentation-only changes require a separate PR with at least one architecture peer review.

Versioning Conventions

ADRs: Sequential (ADR-001, ADR-002). Never renumber.
Fact Sheets: SemVer (1.0, 1.1). Major version for structural changes.
Data Contracts: SemVer. Major for breaking changes.

Runbook template

A runbook is a documented response for a known failure mode. Every failure mode listed in this blueprint should have a corresponding runbook to reduce mean time to recovery (MTTR).

Example: Pub/Sub Consumer Lag

Trigger: Alert: pubsub_subscription_backlog_seconds > 300
Severity: P2 (High Impact, non-blocking)
Impact: Real-time dashboards show stale data; SLAs for downstream delivery breached.
Owner: Data Platform On-call

Diagnosis Steps

Check Dataflow worker CPU utilization in Cloud Monitoring.
Verify if "Hot Keys" are reported in the Dataflow logs.
Confirm source system is not emitting a massive burst of events.

Resolution Steps

Increase max_workers for the Dataflow job via CLI: gcloud dataflow jobs update...
If lag persists, check for schema-related errors in worker logs and roll back if a recent deploy occurred.

Maintenance Rules

Update runbooks after every incident requiring a deviation from documented steps. Review annually even if never triggered.

Key Takeaways

01 Document decisions (ADRs) and contracts, not internal implementation.
02 Use LeanIX-style fact sheets for structured asset management.
03 Store all documentation in version control alongside code.
04 Treat data contracts as formal agreements requiring consumer sign-off.

Checklist

□ ADR directory established in the main repository.
□ Fact sheet created for every major platform component.
□ Runbooks drafted for all listed failure modes.
□ RACI matrix reviewed and owners assigned roles.

Failure Modes

The Archaeology Problem

No record of why a tool was chosen, making it impossible to evolve the system without fear.

Stale Documentation

Docs written once at project start and never updated, leading to confusion during incidents.

Documentation Overload

Excessive requirements causing teams to skip documentation entirely to stay fast.