Privacy and PII | Context Governance for Enterprise AI

PII in AI systems represents both a compliance and an operational risk. Sensitive information can leak through retrieval into prompts, appear in logs, or be stored indefinitely in vector indexes where deletion is complex. Governance must address detection, anonymisation, retrieval filtering, logging restrictions, and the right to deletion.

Key Takeaways

• PII must be detected and handled at ingestion, not at retrieval time.
• Anonymisation and pseudonymisation serve different purposes; choose deliberately.
• Deletion from vector indexes requires an explicit, tested procedure.

PII lifecycle in the context pipeline

Detection and handling at ingestion prevents PII from propagating downstream.

PII categories in enterprise AI context

Direct identifiers

Names, email addresses, phone numbers, national IDs

High: directly identifies an individual

Indirect identifiers

Job title + location + age range, device ID, IP address

Medium: identifiable in combination

Sensitive categories

Health data, financial data, union membership, biometrics

Critical: special category under GDPR and similar frameworks

Anonymisation vs. pseudonymisation

These are not interchangeable. Choosing the wrong approach creates either false compliance confidence or unnecessary loss of utility.

Anonymisation

• Irreversible: the individual cannot be re-identified
• No longer subject to GDPR data subject rights after anonymisation
• Risk: hard to achieve; re-identification is often possible with auxiliary data
Use when: the data will never need to be linked back to an individual

Pseudonymisation

• Reversible: the mapping key is held separately
• Still considered personal data under GDPR
• Reduces exposure risk without losing re-linkability
Use when: audit trails or re-identification may be legitimately needed

PII in retrieval and prompt context

Even when source documents are classified correctly, retrieval can pull PII into a prompt. Policy alone is not sufficient; active filtering and prompt-level controls are required.

→ PII detection should run on retrieved chunks before prompt assembly, not just at ingestion.
→ If PII is detected in a retrieved chunk, either redact it or exclude the chunk from the prompt.
→ System prompts must not include credentials, personal data, or user-specific PII from previous sessions.
→ Model output should be scanned for PII before logging; logs are a common PII exfiltration path.

Warning

Logging full prompts and responses is a common observability practice that creates a hidden PII store. Minimise what is logged and apply the same classification controls to log storage as to source data.

Right-to-deletion and vector indexes

Vector indexes do not support deletion like relational databases. GDPR right-to-erasure requires a tested deletion procedure rather than assuming it is possible.

Deletion strategy

→ Every chunk must store a source_id and data_subject_id (where applicable)
→ Maintain a deletion queue: erasure requests are processed on a defined SLA
→ After deletion from the index, confirm with a retrieval test that the data is no longer returned
→ Check backup indexes and cached results: deletion is not complete until all copies are handled

Retention enforcement

→ Assign a retention_expiry to every chunk at ingestion
→ Run automated expiry sweeps on a defined schedule
→ Treat index pruning as a regular operational task, not an exception

GCP mapping

Illustrative. Each layer maps to equivalent services on AWS, Azure, or any cloud.

PII Detection

Cloud DLP (Data Loss Prevention)

Redaction at scale

Cloud DLP + Dataflow pipeline

Deletion tracking

BigQuery + Vertex AI Vector Search delete API

Log PII control

Cloud Logging exclusion filters + DLP scan

Failure modes

! PII is only detected at ingestion; retrieval pulls it into prompts unchecked.
! Full prompts and responses are logged, creating a shadow PII store.
! Deletion from vector indexes is assumed possible but never tested.
! Anonymisation is claimed but the data remains re-identifiable with auxiliary sources.
! Right-to-erasure requests have no SLA or completion verification.

Checklist

□ PII detection runs at ingestion and again at retrieval before prompt assembly.
□ Anonymisation vs. pseudonymisation is a deliberate, documented choice per source.
□ Deletion procedure for vector indexes is documented and tested.
□ Log content is PII-minimised; full prompt/response logging is restricted.
□ Retention expiry is set at chunk level and enforced by automated sweeps.