Security and governance are not optional add-ons. They must be built into the platform from the start: data classified at ingestion, access enforced by IAM policy, and every sensitive field masked before it lands in the lake. Retrofitting any of this after the fact is orders of magnitude harder.
Key Takeaways
- 01 Least privilege access via group-based identities.
- 02 Mandatory data classification tagging.
- 03 Encryption at rest and in transit as a global standard.
- 04 Federated governance: Standards set centrally, enforced locally.
Checklist
- □ Access model defined and documented per data product.
- □ Classification tags applied to all datasets and columns.
- □ Audit logs enabled and retained according to policy.
- □ Service accounts used for all pipeline operations.
Identity and access
Access is managed at the domain level, following the principle of least privilege. Group-based identities simplify management and mean permissions are tied to roles, not individuals. When someone changes teams, one group change handles the revocation.
Service Identities
Pipelines use dedicated service accounts, never personal credentials.
Row-Level Security
Restrict data access based on user attributes (e.g., country code).
Just-In-Time Access
Elevated permissions are granted only when needed for debugging.
Data classification
| Level | Definition | Example |
|---|---|---|
| PII / Sensitive | Identifiable personal info. Requires strict masking. | Email, Home Address |
| Restricted | Business-sensitive data. Needs 'need-to-know'. | Profit Margins, Vendor Contracts |
| Public | Safe for all employees to view. | Product Catalog, Store Locations |
Sensitive data processing
When processing highly sensitive data, extra isolation is required. Three mechanisms apply here:
- Tokenization: Replacing sensitive values with non-sensitive tokens before they land in the lake.
- Confidential Computing: Processing data in encrypted memory enclaves (where available).
- Audit Logging: Every access to sensitive data is logged and periodically reviewed.
Failure modes
- ! "Everyone gets access": To avoid friction, teams grant wide permissions, leading to data leaks.
- ! No Audit Trail: A data breach occurs, and there is no record of who accessed the data or when.
- ! PII Leak: Raw personal data accidentally lands in a 'Public' Gold table because it wasn't classified at source.
- ! Key Loss: Encryption keys are managed poorly, leading to permanent data loss during a region failover.