An Engineer’s Runbook for Building Data Privacy Tools

Matt Zhou
10 min readFeb 9, 2023

--

What I wish I knew in my first 90 days of privacy engineering work in a new organization

The emerging field of privacy engineering is rapidly evolving with new data regulations pushing technologists to innovate safer and scalable ways to protect user data.

After gaining experience designing privacy infrastructure and tooling at several different companies, I wanted to share lessons learned for anyone seeking to improve privacy culture and tooling within their technical organization. This runbook’s main focus is to describe key considerations in building privacy-by-design software and recommendations on effective privacy engineering within a technical organization.

Photo by Jason Dent on Unsplash

Given how data flows freely throughout an organization and is often transformed in unforeseen ways for analytics, operating functions, and monitoring purposes, the scope of data governance work can rapidly explode out into a vast cross-functional problem space.

Embedding privacy engineering into a tech team’s architecture and tooling is critical because of 3 key trends:

  • Compliance and regulatory reasons that can levy massive fines for failure to comply with privacy rules around the world.
  • Protecting brand safety for a business so that they can safeguard user trust in their services without fearing personal information exposure and reputation damage.
  • Adopting a privacy-first posture as a business differentiator for consumers, in response to rising distrust by consumers around data privacy violations.

Privacy Tool Design

Privacy engineering will involve a mix of techniques, technologies, and design patterns in order to solve the unique privacy problems within an organization.

We can view privacy software tool design as falling into 4 major buckets of functionalities:

  1. Introspection: understanding how your systems work.
  2. Detection: the ability to identify when privacy violations occur
  3. Prevention: the ability to deter issues from arising in the first place, and
  4. Verification: the ability to prove that you are compliant (i.e. ROPAs and PIAs)

A valuable exercise when entering a new data ecosystem as a privacy engineer is to review the existing arsenal of privacy tools and infrastructure within the organization. Organizing these tools within the four buckets of functionalities can start to reveal organizational bottlenecks and weaknesses in internal compliance strategies.

Introspection

The ability to understand and observe the internal state of your systems is derived from proper metrics instrumentation, standardized schema management and evolution, and authoritative golden path tooling across a data organization.

All of these concepts drive towards a central idea of standardized infrastructure and metadata tagging that allow different teams to all speak the same language, measure the same things, and share work in comprehensible ways.

Some general recommendations on system introspection:

  • Develop a shared definition of what constitutes PII for the technical organization. Generally, it helps to decentralize ownership of metadata tagging and PII identification to domain teams instead of making a single platform team a bottleneck in maintaining governance infrastructure. Decentralized ownership of data domains means that everyone should be referencing a consistent and authoritative definition of PII and how to identify it within data assets. Having confidence in PII detection coverage across datastores in the organization is a pre-requisite for good system introspection.
A sample risk hierarchy from a Datahub blog post on PII tagging (copyright DataHub)
  • Build a privacy taxonomy for your organization’s datastores — this model should incorporate both privacy risk levels (critical, sensitive, public) as well as legal jurisdiction tags (GDPR, CCPA, HIPAA, COPPA, etc). Defining this taxonomy correctly makes it easier to answer key operations questions quickly, such as: “which domain functions handle the most critical risk level data?” or “which systems will need a GDPR right to be forgotten compliance function applied to them?”. Having a proper hierarchy for PII also allows teams to begin to apply inheritance to compliance logic and account for PII evolution as data regulations change over time.
  • Parameterize Data Subject Requests (DSR) infrastructure in a way that allows reuse and inheritance across different legal jurisdictions. A developer team should be able to specify geographic region as an input for a data deletion service to fulfill both GDPR and CCPA “right to be forgotten” requests without re-write new pipelines. Another example might look like how to handle data processing for a child in California under 13, which would require the union of regulatory requirements for both the CCPA as well as COPPA.
  • Schema registries are important in having an administrative framework to manage dataset metadata in structured ways. Without a schema registry, there was little confidence in teams aligning on correct schema versions, consistent application of schema evolution, and a shared understanding of PII processing. In addition to schema management, a schema registry also centralizes artifacts like SerDes that can assist developer teams in publishing and consuming datasets according to policy standards. SerDes are great places to hook in additional custom logic around encryption-by-default logic for tagged PII fields or automating security policies.

Detection

Detecting privacy violations, pipeline failures, and anomalous behavior within your systems is a critical step in achieving compliance confidence. This encompasses timely alerting on known failure states, measuring observability coverage of your systems, and recovering from failure cases with mitigation runbooks.

Recommendations on designing detection tools include:

  • Setting SLAs and alerts on key operational metrics that drive north star business metrics.
  • Defining the correct shared suite of metrics that build your definition of privacy infrastructure health — production incident volume, secrets leaks, unprotected PII flags, volumes/rates/time-to-completion for DSR requests, etc. What metrics constitute privacy risk to the organization? What proxy metrics provide a lead-time window for mitigation of those privacy risks?
  • Outlining mitigation strategies for incidents that prescribe an incident command chain, runbooks for quickly mitigating known failure patterns, and critical stakeholder communication schedules. Privacy incidents (and data issues in general) can be uniquely complex in their abilities to cut cross-functionally across different layers of an organization that utilize a shared dataset. Data systems can experience two-fold problems, where infrastructure problems then create knock-on data quality issues that need to be resolved.

Prevention

There is a rich body of existing engineering resources devoted to preventative privacy tooling — the privacy-by-design principles, best practices from the SecOps field, and innovative zero-knowledge proof data processing architectures. The trick is knowing which resources and software engineering skills to use for specific privacy engineering use cases.

In many cases, knowing the data access patterns, existing technology tool ecosystem, and user skillsets for a particular privacy use case will help in narrowing down the system design constraints and necessary product features. The following section describes general recommendations for different modalities of data access and domains.

Microservices

Notes and recommendations on microservices include:

  • Microservices handling API responses typically don’t treat long term data storage as a first-class concern, and commonly offload those responsibilities to data lakes and databases.
  • The central role that microservices play in consuming data from other data sources and in forwarding data to other services, queues, and data consumers means that they do require a well-structured schema management framework for reading and writing data to other data processors.
  • This means that microservices may prioritize SerDes, schema artifacts like Protocol Buffers, and other tools for reading and writing data vs. operational needs in maintaining and storing data. Microservice developers may prefer API service endpoints or importable SDKs as entrypoints into privacy platforms.

Streaming

Commonly, streaming pipelines function as a PubSub intermediary between data producer and consumer applications — most use cases don’t involve long term storage of data within the queues. Streaming pipeline best practices typically encourage stateless processing in order to promote idempotency and at-least once semantics. Similar to microservices, streaming pipelines can be effective leverage points in data transformation logics but less so for data storage.

Notes and recommendations on streaming pipelines include:

Databases

Notes and recommendations on databases include:

  • Databases are a powerful leverage point for privacy tooling because they have historically been a governance layer for data architectures. The operational workflows for maintaining dataset metadata attributes around ownership, lineage, and structured schema evolution are a natural point for being able to apply CRUD operations to data in for compliance purposes.
  • Many databases offer whole-table encryption at rest and some even allow per-row level ACL functionalities (Snowflake, BigQuery, SQL Server, etc). Proper configuration of IAM user groups can offer an additional layer of access protection for sensitive data.
  • The query interfaces and indexed search capabilities of databases make DSR requests relatively easy to execute as long as the correct user identifier fields for a dataset are known. This “search and destroy” method can iterate over known database tables and apply DELETE scripts for a particular user with relative ease, as long as data asset owners are onboarding new tables with user identifier annotations in a consistent and timely way.
Rent the Runway’s implementation of a crypto shredding architecture (copyright Rent the Runway)
  • An alternative approach to DSR compliance for databases is a crypto shredding architecture that utilizes encryption-by-default at a per-user level and the deletion of encryption keys to achieve an exactly-once data deletion paradigm. An implementation of this for databases could look like the “Lost Key” pattern, where a UDF applies encryption logic to PII at a per-user level by default. A federated view of the PII using the UDF can link back to the table of per-user encryption keys to decrypt data on-demand.
  • Some caveats: the join key identifier between tables should be a newly generated uuid that stays consistent over time. This is important because GDPR considers reference ids that enable re-identifying joins for personal data as PII that must be protected — a newly generated uuid would only link to the encryption key table (which should be secured by a higher-level protection of IAM policies).

Data lakes

Data lakes have their own unique characteristics as a datastore that should inform privacy engineering designs. Data lakes have:

  • Unstructured data that may not have search indexes to filter data by user identifiers.
  • Storage is cheap — resulting in high volumes of data being published with compression and serialization formats that are often not human-readable.
  • Data lake assets typically get copied frequently and in large quantities across workspaces for diverse reasons — testing data for development environments, training data for machine learning models, and backups for disaster recovery.
  • Data lake assets are commonly most utilized by large-scale batch jobs that perform enrichment and aggregation — this results in data engineers having the most interactions with this kind of storage.

Notes and recommendations on data lakes include:

  • Be careful with exporting pseudonymized or anonymized data from production environments to testing environments — the difference between encryption, pseudonymization, and anonymization can be crucial in reducing security risks and re-identification risks for sensitive data. This is especially critical given re-identification incidents stemming from machine learning capabilities and the ability to merge anonymized data with public datasets to recover user identities.
  • Re-publishing data lake assets can be extremely expensive at scale — privacy tooling should incorporate flexibility and merge-on-read capabilities that can guarantee fresh dependencies when displaying data via a presentation layer. The idea of “data lakehouses” have recently been in the spotlight, with frameworks like Apache Hudi and Iceberg offering effective options for incremental data processing for federated merge-on-read views of data lake assets.
  • Since data lakes often drive large-scale batch data processing pipelines using distributed processing frameworks like Apache Spark, offering SerDes that can be configured into the Spark workflow is an intuitive way to standardize shared-logic needs such as reading metadata annotations or anonymizing specific fields without developer teams needing their own custom code.

Verification

Building verification tooling is a crucial step in privacy infrastructure because it holds other compliance tooling accountable for showing their work when completing DSRs. If privacy detection tooling is about identifying failure states, verification is about capturing success states — through artifacts like audit trails that describe the transactional log of data processing activities. This gives an organization the ability to provide evidence of its data protection practices in proving compliance.

Recommendations on designing verification tools include:

Additional Resources

During my exploration of privacy engineering resources, I came across these amazing blogs, articles, and presentations that helped guide my journey. I’ve listed them here to facilitate any others looking to dive deeper into advanced concepts or case studies.

Feel free to connect with me on LinkedIn, Twitter, Github, or Medium!

--

--

Matt Zhou
Matt Zhou

Written by Matt Zhou

Engineering Manager @VillageMD, previously Data@newyorktimes

No responses yet