SYSTEM_CONSOLE v2.4.0

Architecture Documentation

LAST_UPDATED: 2026-03

Architecture documentation is the memory of the platform. This page defines a documentation strategy that serves both consultants delivering a formal handover and internal teams maintaining a long-lived platform. We treat documentation as code: version-controlled, peer-reviewed, and stored alongside the implementation it describes.

What to document and what to skip

Most architecture documentation fails because teams document everything or nothing. Effective documentation focuses on decisions, not implementations—the code itself documents the implementation. Document the why, not just the what. Future engineers need to understand the reasoning to avoid relitigating the same trade-offs.

Must always be documented

  • Architectural decisions with trade-offs considered (ADRs)
  • Component boundaries and domain ownership
  • Data contracts between domains
  • Non-functional requirements (NFRs) and architectural alignment
  • Known limitations and accepted technical debt
  • Runbooks for operational incidents

Skip (Do not document)

  • Internal implementation details that change frequently
  • Anything expressed clearly in code or configuration
  • Diagrams that duplicate what is already in the codebase

Rule of thumb: if a new senior engineer joining the team would ask the question, document the answer.

LeanIX-style documentation

A LeanIX-style approach uses a structured fact sheet per architecture component. This is a documentation pattern, not a product dependency; it works with or without the LeanIX tool. Each fact sheet provides a snapshot of a component's business value, technical health, and risk profile.

Fact Sheet Structure

Component Name
Unique identifier for the system or service.
Business Capability
The specific business function served (e.g., "Order Fulfillment").
Technical Owner
The team accountable for maintenance (not an individual).
Lifecycle Status
Planned / Active / Deprecated / Decommissioned.
Deployment Model
GKE / Cloud Run / Managed Service / etc.
Dependencies
Upstream (consumed) and Downstream (consumers).
Data Classification
Public / Internal / Confidential / Restricted.
SLA
Availability target, RTO, and RPO.

Example: Pub/Sub Ingestion Layer

Component Name:platform-ingestion-pubsub
Business Capability:Real-time Data Ingestion
Technical Owner:Data Platform Team
Lifecycle Status:Active
Deployment Model:GCP Managed Service (Global)
Upstream:Source Domain Connectors (Orders, Inventory)
Downstream:Dataflow Processing Jobs, BigQuery Subscriptions
Data Classification:Internal (contains PII in encrypted fields)
SLA:99.95% Availability; RTO: 15m; RPO: 0 (replicated)
Known Risks:Regional outage could delay ingestion; cost spikes on high volume.
For Consultants

Deliver one fact sheet per major component as part of the architecture handover package to ensure the client has a clear asset inventory.

For Internal Teams

Store fact sheets in the same repository as the component. Review quarterly to ensure metadata reflects the current state.

Architecture Decision Records (ADRs)

ADRs prevent the "re-litigation" of technical choices. Decisions made without documentation are challenged every six months by people who were not in the room. ADRs capture the constraints and context of a specific moment in time.

ADR Template

ADR_TEMPLATE.md
# ADR-XXX: [Title in present tense]
## Status
Proposed / Accepted / Deprecated / Superseded by [ADR-YYY]
## Context
What situation forced this decision? What were the technical, financial, or time constraints?
## Decision
What was decided? State clearly without hedging.
## Alternatives Considered
- Alternative A: Why it was rejected (be honest).
- Alternative B: Why it was rejected.
## Consequences
  • - What becomes easier?
  • - What becomes harder?
  • - What is now a constraint for future decisions?
## Review Date
YYYY-MM-DD (When to revisit this decision)

Example: Kafka vs. Pub/Sub

ADR-004: Kafka vs. Pub/Sub ACCEPTED
Context
We are moving to a multi-domain data mesh. Without schema enforcement, downstream consumers break frequently due to upstream field renames and type changes. Protobuf was considered but our existing dbt and BigQuery tooling has better native support for Avro.
Decision
All events published to Kafka must use Avro serialization. No message can be produced without a registered schema in the Confluent Schema Registry.
Alternatives
  • JSON Schema: Rejected due to larger payload size and lack of strict type safety in some target systems.
  • Protobuf: Rejected because BigQuery external table support for Protobuf on GCS was less mature than Avro at the time of decision.
Consequences
Producers now have a build-time dependency on the schema registry. Schema evolution must follow backward-compatibility rules. Payload sizes are reduced by 40% compared to JSON.
Review Date
2026-12-01
When to write an ADR
  • Decisions affecting more than one team
  • Decisions that are hard to reverse
  • Decisions where reasonable engineers would disagree
  • Decisions made under significant constraint
When NOT to write an ADR
  • Routine choices within a single team's boundary
  • Decisions that can be changed without downstream impact

Management

Store ADRs in version control alongside the architecture (e.g., a /docs/adr folder), never in Confluence. Never delete an ADR; deprecate or supersede it. Link superseding ADRs to the records they replace. Review all ADRs annually as part of architecture governance.

Connecting documentation to platform components

Documentation is only useful if it is findable and linked to the resources it describes. Every major platform component in this blueprint maps to a specific set of documentation artifacts.

Component Required Artifacts Key ADR Focus
Pub/Sub / Kafka Fact Sheet, Runbook, Data Contract Streaming backbone selection
Dataflow Fact Sheet, Runbook Stream processing engine selection
BigQuery Fact Sheet, Data Contract Analytical storage selection
Schema Registry Fact Sheet, ADR Schema enforcement approach
dbt Fact Sheet, ADR Transformation layer selection
Looker Fact Sheet, ADR Visualisation layer selection

How to document data contracts

A data contract is a formal agreement between a producer domain and its consumers. It moves beyond "hope-based" integration to explicit guarantees on schema, reliability, and ownership.

Data Contract Structure

  • Contract Name/Version: Unique ID and semantic version.
  • Producer: Team and service producing the data.
  • Consumers: List of known consumer teams and use cases.
  • Schema: Link to the schema registry or definition file.
  • SLA: Latency, availability, and completeness targets.
  • Breaking Change Policy: Notice period (e.g., 30 days) and sunset timeline.
  • Owner: Team with write access to this specific contract.

Example: Logistics Domain Contract

VehicleTelemetry-v1.0.0.contract
Producer
Fleet-Management-System (Domain: Logistics)
Topic
gcp.prd.logistics.v1.vehicle-telemetry
Schema
https://registry.internal/logistics/vehicle-telemetry-v1.avsc
SLA
99.9% delivery within 1 second of ingestion.
Consumers
Analytics-Domain (Fleet-Performance-Report), Maintenance-Domain (Predictive-Service)
Policy
Major version changes require 60 days notice via #logistics-dev.
Owner
Team-Atlas (Logistics Platform)

Treat data contracts as code. Store them in version control; a contract change should be a Pull Request requiring approval from consumer teams if it affects their downstream processing.

RACI matrix for documentation ownership

Documentation without ownership decays. Within six months, it becomes stale and untrusted. Every document type must have a clear owner role responsible for its accuracy.

Artifact Platform Arch Domain Tech Lead Eng Team Enterprise Arch Product Owner
ADRsARCII
Fact SheetsCARII
Data ContractsIARIC
RunbooksIIA/R-I
DiagramsACRI-

Responsible | Accountable | Consulted | Informed

  • Every document has exactly one Accountable owner (a role, not a person).
  • Consulted parties must respond within 3 business days.
  • Review Cadence: ADRs (Annually), Fact Sheets (Quarterly), Data Contracts (on change), Runbooks (after incident).

Versioning and change management

Architecture documentation lives in version control—not Confluence or SharePoint. Changes should be reviewable, reversible, and linked to the code they describe.

Branching Strategy

Documentation changes accompanying code go in the same Pull Request. Documentation-only changes require a separate PR with at least one architecture peer review.

Versioning Conventions
  • ADRs: Sequential (ADR-001, ADR-002). Never renumber.
  • Fact Sheets: SemVer (1.0, 1.1). Major version for structural changes.
  • Data Contracts: SemVer. Major for breaking changes.

Runbook template

A runbook is a documented response for a known failure mode. Every failure mode listed in this blueprint should have a corresponding runbook to reduce mean time to recovery (MTTR).

Example: Pub/Sub Consumer Lag

Trigger
Alert: pubsub_subscription_backlog_seconds > 300
Severity
P2 (High Impact, non-blocking)
Impact
Real-time dashboards show stale data; SLAs for downstream delivery breached.
Owner
Data Platform On-call
Diagnosis Steps
  1. Check Dataflow worker CPU utilization in Cloud Monitoring.
  2. Verify if "Hot Keys" are reported in the Dataflow logs.
  3. Confirm source system is not emitting a massive burst of events.
Resolution Steps
  1. Increase max_workers for the Dataflow job via CLI: gcloud dataflow jobs update...
  2. If lag persists, check for schema-related errors in worker logs and roll back if a recent deploy occurred.
Maintenance Rules

Update runbooks after every incident requiring a deviation from documented steps. Review annually even if never triggered.


Key Takeaways

  • 01 Document decisions (ADRs) and contracts, not internal implementation.
  • 02 Use LeanIX-style fact sheets for structured asset management.
  • 03 Store all documentation in version control alongside code.
  • 04 Treat data contracts as formal agreements requiring consumer sign-off.

Checklist

  • ADR directory established in the main repository.
  • Fact sheet created for every major platform component.
  • Runbooks drafted for all listed failure modes.
  • RACI matrix reviewed and owners assigned roles.

Failure Modes

The Archaeology Problem

No record of why a tool was chosen, making it impossible to evolve the system without fear.

Stale Documentation

Docs written once at project start and never updated, leading to confusion during incidents.

Documentation Overload

Excessive requirements causing teams to skip documentation entirely to stay fast.