...
The first step in preparing a DLP scan is to choose a relevant set of standard infoTypes for your use case. The Centre currently includes has used the following standard global infoTypes in its scans: AGE, CREDIT_CARD_NUMBER, DATE, DATE_OF_BIRTH, EMAIL_ADDRESS, ETHNIC_GROUP, GENDER, GENERIC_ID, ICD10_CODE, ICD9_CODE, IMEI_HARDWARE_ID, LOCATION, MEDICAL_TERM, ORGANIZATION_NAME, PERSON_NAME, PHONE_NUMBER, STREET_ADDRESS, and CREDIT_CARD_NUMBERURL.
The Google DLP team controls the standard infoType detectors, meaning they may update them or add new ones periodically. In order to monitor these externally unavailable changes, the Centre recommends that organisations create a benchmark file to test at regular intervals.
...
Typically, custom infoTypes are dictionaries (e.g. text files containing lists of words or phrases, with each new line treated as its own unit). DLP only matches alphanumeric characters, so all special characters are treated as whitespace. For example, “household size” will match “household size,” “household-size,” and “household_size.” Dictionary words are also case-insensitive.
Custom infoTypes may also be regular expressions, enabling DLP to detect matches based on regular expression patterns.
The Centre has created and currently uses a set of twelve custom infoTypes:
Custom infoType | Format | Description |
DISABILITY_GROUP | Dictionary | A list of disabilities / disabled 'groups' or groups with limited 'functioning' per standard classification |
EDUCATION_LEVEL | Dictionary | A list of different indicators for level of education (e.g. ‘OSY’ for out of school youth) |
GEO_COOR | Regular expression | Latitude and longitude coordinates |
HDX_HEADERS | Dictionary | A set of commonly seen column names that may indicate presence of sensitive data , (e.g. Key Informant‘Key Informant’) |
HH_ATTRIBUTES | Dictionary | Words A list of words indicating specific attributes of a household (e.g. ‘Child_Headed_Household’) |
HXL_TAGS | Regular expression | A subset of all existing HXL tags that have been associated with (potentially) sensitive data. |
MARITAL_STATUS | Dictionary | A list of marital statuses |
OCCUPATION | Dictionary | A list of employment statuses and common occupations |
PROTECTION_GROUP | Dictionary | Include A list of different indicators for populations of concern (e.g. ‘pregnant’ or ‘unaccompanied child’) |
RELIGIOUS_GROUP | Dictionary | A list of religions / religious groups in different languages |
SEXUALITY | Dictionary | A list of sexual orientations |
SPOKEN_LANGUAGE | Dictionary | A list of spoken languages |
...
Unlike the standard infoTypes, the custom infoTypes must be maintained by your organisation. Some custom infoTypes are essentially complete from the outset; others may require updates over time as your organisation learns and adapts.
The Centre has organized its custom infoTypes into three main categories, as described in the table below: comprehensive, comprehensive in context, and not comprehensive.
...
Category
...
Description
...
(1) Comprehensive:
GEO_COOR (regex)
HXL_TAGS (regex)
PROTECTION_GROUP
RELIGIOUS_GROUP
SEXUALITY
SPOKEN_LANGUAGE
Static. No updates needed unless errors or omissions are found.
Example: SPOKEN_LANGUAGE will not need to be updated unless certain rare or dying languages appear to be missing.
...
(2) Comprehensive in context:
DISABILITY_GROUP
EDUCATION_LEVEL
MARITAL_STATUS
...
Functionality will be dependent on the correct context of key terms.
Example: “single” is not exclusively a marital status, just as “primary” is not always an education level.
...
(3) Not comprehensive:
OCCUPATION
HH_ATTRIBUTES
HDX_HEADERS
...
Difficult to capture all possibilities upfront; may need updates as more datasets are scanned.
Example: “child_headed”, “families headed by children”, and “hohh child” all express the same household attribute; different data contributors may have their own versions.
Over time, we will refine our use of DLP based on its performance. This process will involve Over time, the Centre will refine its use of DLP by adding, updating, or removing custom infoTypes across these three categories to improve the detection of different forms of sensitive data.
...
Once an organisation has selected standard infoTypes and created custom infoTypes to use in their scans, it is time to test and refine DLP to make sure the output meets their requirements.
Given data input, DLP returns details about 1) the infoTypes detected in the text; 2) a likelihood, ranging from VERY_UNLIKELY to VERY_LIKELY with default POSSIBLE, that indicates how likely it is that the data matches the given infoType; and 3) a quote, which is the actual string of data identified as the infoType.
Because organisations will Over the course of 4 months, the Centre conducted 6 rounds of testing on a set of 70 sensitive datasets and 70 placebo datasets. This testing strategy is not necessarily universal; organisations should focus on the DLP capabilities most relevant to their existing data management process, there is no one-size-fits-all approach to testing. However, we recommend answering two key questions before deploying DLP:
Can DLP detect all or most types of sensitive data that we encounter in our existing data management process?
How accurately do the detected infoTypes from a scan match the PII and key variables we are trying to catch?
Over the course of 4 months, the Centre conducted 6 rounds of testing on a set of 70 sensitive datasets and 70 placebo datasets to assess these questions. While the answers may seem obvious from an initial glance at the infoType descriptions, we found it crucial to observe the DLP’s detection mechanisms in actiondetail. For example, when we used using the LAST_NAME infoType in early stages of testing, we looked at the quotes and realized discovered it was flagging both refugee camp names and along with actual surnames. The infoType was not accurately matching the variable we intended to catch. Because microdata uploaded to HDX frequently contains camp names, we determined decided that the LAST_NAME infoType was not particularly helpful to include in the Centre’s scans. Additionally, we expected the LOCATION infoType to detect GPS coordinates, but found that it did not do so in practice. In other words, DLP did not initially detect a type of sensitive data we were looking for. Because of thisIn response, we adjusted our initial assumptions and created a custom infoType to detect longitude and latitude coordinates.
Your organisation Organisations may ultimately differ from the Centre in your answers to the above questionstheir findings, but these types of contextual observations and decisions in light of the two key questions are what underlie define a robust testing process for DLP. Accordingly, the process should draw upon cross-functional expertise from teams across your organisation, not just the data scientists. At the Centre, the Development team managed the technical details of DLP while both the Data Partnerships and the Data Responsibility teams analyzed the outputs of the scans.
...
Even once an organisation is confident that DLP accurately detects the types of sensitive data present in their context, the output of a DLP scan alone does not determine whether a given dataset is sensitive. Ultimately, each organisation needs to define its their own criteria for interpreting the DLP output (e.g. does the presence of a single instance of an infoType mean a dataset is sensitive?)
Depending on the number of standard and custom infoTypes included in the inspection, the raw output of a DLP scan may comprise anywhere from thousands zero to millions of rows of detected matches. On average, it took our team 48 hours to review the full results of a our test scan (scans, which included 70 sensitive datasets and 70 placebo datasets). While reviewing the results for the scan of a scan for a single dataset would take much less time, this process still proved onerous and was non-conducive to our use case of reaching a binary decision about a dataset’s sensitivity. Based on this difficulty, we proceeded to explore whether we could create a complementary tool or algorithm to classify a dataset as sensitive using the training data generated through DLP testing.
A machine learning approach
To interpret the raw output of a DLP inspection scan, the Centre has developed a robust model that averages the results of a random forest model, a generalized linear model, and a gradient boosting model to predict the possibility that a dataset is sensitive or non-sensitive. We generated the training data for this model from our 6 rounds most recent round of testing on 70 sensitive datasets and 70 placebo datasets. The model uses detected infoTypes, likelihoods, and quotes as the main elements in its analysis.
...
As the model is used to classify more and more datasets, its prediction values should become more and more accurate. If we start to see higher levels of model error – e.g. if the QA officer often disagrees with the model’s classification of sensitivity – we will revisit our DLP testing process, may reevaluate our use of certain standard and custom infoTypes, and will retrain the model accordingly.
In this way, we will refine our use of DLP over time based on its performance. We will continue to update this document based on what we learn.
Annex: Navigating the official Google DLP documentation
Google maintains extensive documentation of Cloud DLP, including quickstart guides, references, and code samples. The links below provide a simple way to navigate that documentation and assess which capabilities are relevant in your organization’s context.
...