PII in AI systems represents both a compliance and an operational risk. Sensitive information can leak through retrieval into prompts, appear in logs, or be stored indefinitely in vector indexes where deletion is complex. Governance must address detection, anonymisation, retrieval filtering, logging restrictions, and the right to deletion.
Key Takeaways
- • PII must be detected and handled at ingestion, not at retrieval time.
- • Anonymisation and pseudonymisation serve different purposes; choose deliberately.
- • Deletion from vector indexes requires an explicit, tested procedure.
PII lifecycle in the context pipeline
Detection and handling at ingestion prevents PII from propagating downstream.
PII categories in enterprise AI context
Direct identifiers
Names, email addresses, phone numbers, national IDs
High: directly identifies an individual
Indirect identifiers
Job title + location + age range, device ID, IP address
Medium: identifiable in combination
Sensitive categories
Health data, financial data, union membership, biometrics
Critical: special category under GDPR and similar frameworks
Anonymisation vs. pseudonymisation
These are not interchangeable. Choosing the wrong approach creates either false compliance confidence or unnecessary loss of utility.
Anonymisation
- • Irreversible: the individual cannot be re-identified
- • No longer subject to GDPR data subject rights after anonymisation
- • Risk: hard to achieve; re-identification is often possible with auxiliary data
- Use when: the data will never need to be linked back to an individual
Pseudonymisation
- • Reversible: the mapping key is held separately
- • Still considered personal data under GDPR
- • Reduces exposure risk without losing re-linkability
- Use when: audit trails or re-identification may be legitimately needed
PII in retrieval and prompt context
Even when source documents are classified correctly, retrieval can pull PII into a prompt. Policy alone is not sufficient; active filtering and prompt-level controls are required.
- → PII detection should run on retrieved chunks before prompt assembly, not just at ingestion.
- → If PII is detected in a retrieved chunk, either redact it or exclude the chunk from the prompt.
- → System prompts must not include credentials, personal data, or user-specific PII from previous sessions.
- → Model output should be scanned for PII before logging; logs are a common PII exfiltration path.
Right-to-deletion and vector indexes
Vector indexes do not support deletion like relational databases. GDPR right-to-erasure requires a tested deletion procedure rather than assuming it is possible.
Deletion strategy
- → Every chunk must store a source_id and data_subject_id (where applicable)
- → Maintain a deletion queue: erasure requests are processed on a defined SLA
- → After deletion from the index, confirm with a retrieval test that the data is no longer returned
- → Check backup indexes and cached results: deletion is not complete until all copies are handled
Retention enforcement
- → Assign a retention_expiry to every chunk at ingestion
- → Run automated expiry sweeps on a defined schedule
- → Treat index pruning as a regular operational task, not an exception
GCP mapping
Illustrative. Each layer maps to equivalent services on AWS, Azure, or any cloud.
Failure modes
- ! PII is only detected at ingestion; retrieval pulls it into prompts unchecked.
- ! Full prompts and responses are logged, creating a shadow PII store.
- ! Deletion from vector indexes is assumed possible but never tested.
- ! Anonymisation is claimed but the data remains re-identifiable with auxiliary sources.
- ! Right-to-erasure requests have no SLA or completion verification.
Checklist
- □ PII detection runs at ingestion and again at retrieval before prompt assembly.
- □ Anonymisation vs. pseudonymisation is a deliberate, documented choice per source.
- □ Deletion procedure for vector indexes is documented and tested.
- □ Log content is PII-minimised; full prompt/response logging is restricted.
- □ Retention expiry is set at chunk level and enforced by automated sweeps.