Architecture documentation is the memory of the platform. This page defines a documentation strategy that serves both consultants delivering a formal handover and internal teams maintaining a long-lived platform. We treat documentation as code: version-controlled, peer-reviewed, and stored alongside the implementation it describes.
What to document and what to skip
Most architecture documentation fails because teams document everything or nothing. Effective documentation focuses on decisions, not implementations—the code itself documents the implementation. Document the why, not just the what. Future engineers need to understand the reasoning to avoid relitigating the same trade-offs.
Must always be documented
- Architectural decisions with trade-offs considered (ADRs)
- Component boundaries and domain ownership
- Data contracts between domains
- Non-functional requirements (NFRs) and architectural alignment
- Known limitations and accepted technical debt
- Runbooks for operational incidents
Skip (Do not document)
- Internal implementation details that change frequently
- Anything expressed clearly in code or configuration
- Diagrams that duplicate what is already in the codebase
Rule of thumb: if a new senior engineer joining the team would ask the question, document the answer.
LeanIX-style documentation
A LeanIX-style approach uses a structured fact sheet per architecture component. This is a documentation pattern, not a product dependency; it works with or without the LeanIX tool. Each fact sheet provides a snapshot of a component's business value, technical health, and risk profile.
Fact Sheet Structure
- Component Name
- Unique identifier for the system or service.
- Business Capability
- The specific business function served (e.g., "Order Fulfillment").
- Technical Owner
- The team accountable for maintenance (not an individual).
- Lifecycle Status
- Planned / Active / Deprecated / Decommissioned.
- Deployment Model
- GKE / Cloud Run / Managed Service / etc.
- Dependencies
- Upstream (consumed) and Downstream (consumers).
- Data Classification
- Public / Internal / Confidential / Restricted.
- SLA
- Availability target, RTO, and RPO.
Example: Pub/Sub Ingestion Layer
| Component Name: | platform-ingestion-pubsub |
| Business Capability: | Real-time Data Ingestion |
| Technical Owner: | Data Platform Team |
| Lifecycle Status: | Active |
| Deployment Model: | GCP Managed Service (Global) |
| Upstream: | Source Domain Connectors (Orders, Inventory) |
| Downstream: | Dataflow Processing Jobs, BigQuery Subscriptions |
| Data Classification: | Internal (contains PII in encrypted fields) |
| SLA: | 99.95% Availability; RTO: 15m; RPO: 0 (replicated) |
| Known Risks: | Regional outage could delay ingestion; cost spikes on high volume. |
For Consultants
Deliver one fact sheet per major component as part of the architecture handover package to ensure the client has a clear asset inventory.
For Internal Teams
Store fact sheets in the same repository as the component. Review quarterly to ensure metadata reflects the current state.
Architecture Decision Records (ADRs)
ADRs prevent the "re-litigation" of technical choices. Decisions made without documentation are challenged every six months by people who were not in the room. ADRs capture the constraints and context of a specific moment in time.
ADR Template
- - What becomes easier?
- - What becomes harder?
- - What is now a constraint for future decisions?
Example: Kafka vs. Pub/Sub
- JSON Schema: Rejected due to larger payload size and lack of strict type safety in some target systems.
- Protobuf: Rejected because BigQuery external table support for Protobuf on GCS was less mature than Avro at the time of decision.
When to write an ADR
- Decisions affecting more than one team
- Decisions that are hard to reverse
- Decisions where reasonable engineers would disagree
- Decisions made under significant constraint
When NOT to write an ADR
- Routine choices within a single team's boundary
- Decisions that can be changed without downstream impact
Management
Store ADRs in version control alongside the architecture (e.g., a /docs/adr folder), never in Confluence. Never delete an ADR; deprecate or supersede it. Link superseding ADRs to the records they replace. Review all ADRs annually as part of architecture governance.
Connecting documentation to platform components
Documentation is only useful if it is findable and linked to the resources it describes. Every major platform component in this blueprint maps to a specific set of documentation artifacts.
| Component | Required Artifacts | Key ADR Focus |
|---|---|---|
| Pub/Sub / Kafka | Fact Sheet, Runbook, Data Contract | Streaming backbone selection |
| Dataflow | Fact Sheet, Runbook | Stream processing engine selection |
| BigQuery | Fact Sheet, Data Contract | Analytical storage selection |
| Schema Registry | Fact Sheet, ADR | Schema enforcement approach |
| dbt | Fact Sheet, ADR | Transformation layer selection |
| Looker | Fact Sheet, ADR | Visualisation layer selection |
How to document data contracts
A data contract is a formal agreement between a producer domain and its consumers. It moves beyond "hope-based" integration to explicit guarantees on schema, reliability, and ownership.
Data Contract Structure
- Contract Name/Version: Unique ID and semantic version.
- Producer: Team and service producing the data.
- Consumers: List of known consumer teams and use cases.
- Schema: Link to the schema registry or definition file.
- SLA: Latency, availability, and completeness targets.
- Breaking Change Policy: Notice period (e.g., 30 days) and sunset timeline.
- Owner: Team with write access to this specific contract.
Example: Logistics Domain Contract
Treat data contracts as code. Store them in version control; a contract change should be a Pull Request requiring approval from consumer teams if it affects their downstream processing.
RACI matrix for documentation ownership
Documentation without ownership decays. Within six months, it becomes stale and untrusted. Every document type must have a clear owner role responsible for its accuracy.
| Artifact | Platform Arch | Domain Tech Lead | Eng Team | Enterprise Arch | Product Owner |
|---|---|---|---|---|---|
| ADRs | A | R | C | I | I |
| Fact Sheets | C | A | R | I | I |
| Data Contracts | I | A | R | I | C |
| Runbooks | I | I | A/R | - | I |
| Diagrams | A | C | R | I | - |
Responsible | Accountable | Consulted | Informed
- Every document has exactly one Accountable owner (a role, not a person).
- Consulted parties must respond within 3 business days.
- Review Cadence: ADRs (Annually), Fact Sheets (Quarterly), Data Contracts (on change), Runbooks (after incident).
Versioning and change management
Architecture documentation lives in version control—not Confluence or SharePoint. Changes should be reviewable, reversible, and linked to the code they describe.
Branching Strategy
Documentation changes accompanying code go in the same Pull Request. Documentation-only changes require a separate PR with at least one architecture peer review.
Versioning Conventions
- ADRs: Sequential (ADR-001, ADR-002). Never renumber.
- Fact Sheets: SemVer (1.0, 1.1). Major version for structural changes.
- Data Contracts: SemVer. Major for breaking changes.
Runbook template
A runbook is a documented response for a known failure mode. Every failure mode listed in this blueprint should have a corresponding runbook to reduce mean time to recovery (MTTR).
Example: Pub/Sub Consumer Lag
- Trigger
- Alert:
pubsub_subscription_backlog_seconds > 300 - Severity
- P2 (High Impact, non-blocking)
- Impact
- Real-time dashboards show stale data; SLAs for downstream delivery breached.
- Owner
- Data Platform On-call
Diagnosis Steps
- Check Dataflow worker CPU utilization in Cloud Monitoring.
- Verify if "Hot Keys" are reported in the Dataflow logs.
- Confirm source system is not emitting a massive burst of events.
Resolution Steps
- Increase
max_workersfor the Dataflow job via CLI:gcloud dataflow jobs update... - If lag persists, check for schema-related errors in worker logs and roll back if a recent deploy occurred.
Maintenance Rules
Update runbooks after every incident requiring a deviation from documented steps. Review annually even if never triggered.
Key Takeaways
- 01 Document decisions (ADRs) and contracts, not internal implementation.
- 02 Use LeanIX-style fact sheets for structured asset management.
- 03 Store all documentation in version control alongside code.
- 04 Treat data contracts as formal agreements requiring consumer sign-off.
Checklist
- □ ADR directory established in the main repository.
- □ Fact sheet created for every major platform component.
- □ Runbooks drafted for all listed failure modes.
- □ RACI matrix reviewed and owners assigned roles.
Failure Modes
The Archaeology Problem
No record of why a tool was chosen, making it impossible to evolve the system without fear.
Stale Documentation
Docs written once at project start and never updated, leading to confusion during incidents.
Documentation Overload
Excessive requirements causing teams to skip documentation entirely to stay fast.