Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Introducing DLP: a tool to support the responsible management of humanitarian microdata

As HDX continues to scale and mature, it remains an ongoing challenge to ensure that sensitive or personal data is not exposed publicly on the platform. The HDX Terms of Service prohibit the sharing of personally identifiable information (PII) on HDX, and the HDX team manually reviews every uploaded resource as part of a standard quality assurance (QA) process that flags sensitive, high-risk data to our contributors. This type of data typically emerges from survey or needs assessment data, otherwise known as ‘microdata’. Microdata is integral to crisis response, but may present a re-identification risk to certain individuals and groups depending on the key variables present in the data. For example, aggregate information about age, marital status, and location could allow the re-identification of a specific individual in a camp. Similarly, information about disabilities or child-headed households could be exposed in the free-text field of a needs assessment. Assessing the disclosure risk presented by microdata uploaded to the platform is a key component of how the Centre supports our partners in managing sensitive data more responsibly. 

With support from the Directorate-General for European Civil Protection and Humanitarian Aid Operations (ECHO) and the Foreign, Commonwealth and Development Office (FCDO) COVIDAction programme, the Centre has developed an improved technical infrastructure for the management of sensitive data on HDX. We now automatically screen all data for personal or sensitive information at the point of upload rather than after it is made public. Datasets flagged as sensitive based on our criteria are marked ‘under review’ in the public interface of HDX and made inaccessible until the HDX team completes a manual review of the data. In order to automate this screening process for sensitive data, the Centre uses a detection tool from Google called Cloud Data Loss Prevention (DLP).

What is DLP and what is it designed to do?

Google Cloud Data Loss Prevention (DLP) is a service designed to discover, classify, and protect sensitive information. 

DLP provides: 

  • The ability to use over 120 built-in information type detectors, known as ‘infoTypes’, to identify sensitive data 

  • The ability to define custom infoTypes using dictionaries, regular expressions, and contextual rules

  • The ability to detect sensitive data in streams of text, structured text, storage repositories, and even images

  • The ability to apply de-identification techniques and re-identification risk analyses to the data (although the Centre does not currently utilize this capability)

Given data input, DLP returns details about any infoTypes found in the text, the likelihood that the infoType was correctly identified, and the actual pieces of sensitive text from the dataset. 

Purpose of this document 

The goal of this document is to provide an overview of DLP as used by the Centre and model how it might be included in a responsible data management process. 

Google maintains extensive documentation of DLP that serves as a helpful technical resource when getting started. Our resource supplements and contextualizes that documentation by presenting a humanitarian use case of the tool. We hope that the Centre’s experience will help our partners decide whether to use DLP in their own management processes. 

Customizing DLP: choose and create infoTypes for humanitarian contexts 

Standard infoTypes: built-in detection mechanisms

Cloud DLP provides a set of over 120 built-in information types – ‘infoTypes’ – to define the sensitive data it can detect in a resource. There are both global infoTypes and country-specific infoTypes. For example, Location and Gender are global infoTypes, and France_Passport is a France infoType. Google maintains a list of all its built-in infoTypes here

The DLP team updates standard infoType detectors and releases new ones periodically. In order to monitor these externally unavailable changes, the Centre recommends creating a benchmark file to test at regular intervals. 

Custom infoTypes: personalized detection mechanisms

While standard infoTypes are useful in certain contexts, custom infoTypes allow humanitarian organizations to specify and detect potentially sensitive keywords associated with affected people, humanitarian actors and/or a response. 

Typically, custom infoTypes are dictionaries – text files containing lists of words or phrases, with each new line as its own unit. DLP only matches alphanumeric characters, so all special characters are treated as whitespace. For example, “household size” will match “household size,” “household-size,” and “household_size.” Dictionary words are also case-insensitive. 

Custom infoTypes may also be regular expressions, enabling DLP to detect matches based on regular expression patterns.  

The Centre currently maintains twelve custom infoTypes:

Custom infoType 

Format

Description

DISABILITY_GROUP

Dictionary

A list of disabilities / disabled 'groups' or groups with limited 'functioning' per standard classification

EDUCATION_LEVEL

Dictionary

Include different indicators for level of education

GEO_COOR

Regular expression

Latitude and longitude coordinates

HDX_HEADERS

Dictionary

A set of commonly seen column names that may indicate presence of sensitive data, e.g. Key Informant

HH_ATTRIBUTES

Dictionary

Words indicating specific attributes of a household, e.g. Child_Headed_Household

HXL_TAGS

Regular expression

A sub-set of all existing HXL tags that have been associated with (potentially) sensitive data.

MARITAL_STATUS

Dictionary

A list of marital statuses

OCCUPATION

Dictionary

A list of employment statuses and common occupations

PROTECTION_GROUP

Dictionary

Include different indicators for populations of concern

RELIGIOUS_GROUP

Dictionary

A list of religions / religious groups in different languages

SEXUALITY

Dictionary

A list of sexual orientations

SPOKEN_LANGUAGE

Dictionary

A list of spoken languages

The dictionary infoTypes are currently set up to match two types of information: the column names of key variables and the values of each key variable. For example, SPOKEN_LANGUAGE matches “mother tongue” and “language” as well as specific language names. Similarly, MARITAL_STATUS matches the term “marital status” as well as “married” and “widowed.”

Updating custom infoTypes over time 

Cycle of updating the model until we have a really strong model 

Because custom infoTypes are inherently static, the Centre has developed a category-based system for monitoring and potentially updating the custom infoTypes over time. 

Category

Description

Action

(1) Comprehensive:


GEO_COOR (regex)

HXL_TAGS (regex)

PROTECTION_GROUP

RELIGIOUS_GROUP

SEXUALITY

SPOKEN_LANGUAGE


May have an occasional error or omission. 


Example: SPOKEN_LANGUAGE may be missing certain rare or dying languages. 

Make occasional edits or additions as needed. 

Check the list of key variables from the SDC process. If any key variables were missed in the results, update the specific dictionary accordingly. 

(2) Comprehensive in context:


DISABILITY_GROUP

EDUCATION_LEVEL

MARITAL_STATUS

Difficult to ensure the correct context of key terms. 


Example: “single” is not exclusively a marital status, just as “primary” is not always an education level.

Watch for low accuracy levels. 


If the majority of an infoType’s results are incorrect in the context of each dataset, then consider eliminating certain terms to narrow its scope. 


Example: Even if the words “single” and “separated” were deleted, MARITAL_STATUS could capture most marriage-specific terms. 

(3) Not comprehensive: 


OCCUPATION

HH_ATTRIBUTES

HDX_HEADERS

Difficult to capture all possibilities upfront. 

The dictionaries are highly dependent upon the test datasets seen thus far. 


Example: “child_headed”, “families headed by children”, and “hohh child” 

all express the same household attribute; different orgs may have their own versions. 

Continually update as new values and/or permutations of headers are found.


Check each dictionary against the list of key variables from the SDC process. Make sure all variations of variable names and values are represented. 


Example: If a dataset has an “occupation” column, ensure that all its values are included in the OCCUPATION dictionary. 

Navigating the official Google DLP documentation 

Google maintains extensive documentation of Cloud DLP, including quickstart guides, references, and code samples. The links below provide a condensed way to navigate that documentation and assess which capabilities are relevant in your organization’s context. 

Cloud DLP is capable of: 

Inspection

Redaction

  • Replace any sensitive data (any chosen infoTypes) with placeholder text 

De-identification

  • Techniques include: 

    • Masking sensitive data by partially or fully replacing characters with a symbol, such as an asterisk (*) or hash (#)

    • Replacing each instance of sensitive data with a token, or surrogate, string

    • Encrypting and replacing sensitive data using a randomly generated or pre-determined key

  • Within structured text (tabular data): 

    • Scan a single column for a certain data type instead of the entire table structure

    • Transform a single column

      • e.g. bucketing a column of scores into increments of 10

    • Transform a column based on the value of another

      • e.g. redacting a score for all beneficiaries over age 89 

    • Anonymize all instances of an infoType in a column 

    • Remove/suppress a row entirely based on the content that appears in any column

      • e.g. hide all rows pertaining to beneficiaries over age 89

    • Transform findings only when specific conditions are met on another field 

      • e.g. redact PERSON_NAME if the value in AGE column > 89

    • Transform findings using a cryptographic hash transformation 

Risk analysis

  • 4 techniques to quantify the level of risk associated with a dataset: 

  • k-anonymity: A property of a dataset that indicates the re-identifiability of its records. A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k – 1 other people also in the dataset.

  • l-diversity: An extension of k-anonymity that additionally measures the diversity of sensitive values for each column in which they occur. A dataset has l-diversity if, for every set of rows with identical quasi-identifiers, there are at least l distinct values for each sensitive attribute.

  • k-map: Computes re-identifiability risk by comparing a given de-identified dataset of subjects with a larger re-identification—or "attack"—dataset. Cloud DLP doesn't know the attack dataset, but it statistically models it by using publicly available data like the US Census, by using a custom statistical model (indicated as one or more BigQuery tables), or by extrapolating from the distribution of values in the input dataset. Each dataset—the sample dataset and the re-identification dataset—shares one or more quasi-identifier columns.

  • Delta-presence (δ-presence): Estimates the probability that a given user in a larger population is present in the dataset. This is used when membership in the dataset is itself sensitive information. Similarly to k-map, Cloud DLP doesn't know the attack dataset, but statistically models it using publicly available data, user-specified distributions, or extrapolation from the input dataset.

Testing DLP: assess the accuracy of the scan in context

Testing detection mechanisms

In testing DLP, we set out to answer several key questions: 

  • Can DLP detect all or most types of sensitive data that we have encountered thus far? 

  • How well does DLP work on our microdata set compared to the placebo set? 

  • How accurately do DLP infoTypes match the PII and key variables we are trying to catch? 

  • Combining the different infoTypes based on their level of accuracy, can we establish a set of criteria to classify any given dataset as sensitive or non-sensitive? 

To explore these questions, the Centre conducted six rounds of testing on a microdata set – in other words, a ‘sensitive’ set – and a placebo set. Each set contained ~70 datasets to be scanned. We found that... [*include test 6 results analysis, which don’t currently exist but are technically most accurate? Include our test 4 results analysis?*] 

Flag time investment

Give strategic level recommendation

Is it able to meaningfully scan your data?

Who should be involved? Include technical specialists, contextual experts, cross-functional

The results of our testing produced two new questions: 

  • Can we introduce a complementary tool to process the DLP outputs/results and indicate whether or not a dataset is sensitive? 

  • Can we train or create another alternative tool to scan and classify new data as sensitive using the training data generated so far?

Interpreting DLP: develop actionable criteria for classifying data as sensitive

Limitations of the raw output of DLP 

Depending on the number of infoTypes scanned for and the amount of sensitive data present in the resource, the raw output of a DLP scan may comprise anywhere from thousands to millions of rows of matches. We found that the sheer number of results for microdata files was prohibitive to close analysis. 

A prediction-based approach

  • The output from the DLP scan needs further processing to get to the answer whether or not a given dataset is sensitive(i.e a binary yes/no)

  • We developed a robust model that is averaged over random forest, generalized linear and gradient boosting models to predict the sensitivity of datasets

  • This model is built based on the output from a DLP scan of datasets specifically prepared for training. InfoTypes, quotes and likelihoods are the main elements used in the analysis.

  • On DLP scan output of new datasets, the model is tested and its accuracy of predicting their sensitivity is recorded

  • The QA officer manually reviews the new datasets for sensitivity and accepts or refuses the model’s prediction

***

Main.json file

  • contains the name of the resource, file size, when it was scanned, a link to the other two files, and some info about the scan (how many infotypes were detected, what was used during detection)

  • Written over with the probability prediction from the algorithm 

  • Written over again when the QA officer reviews the prediction and makes a decision

Output from DLP

  • .dlp.json - contains all terms and infoTypes detected 

Debug file

  • we keep as a log to see how the scan went, look into it if questions arise

All 3 files are stored in a separate amazon bucket, separate from resources

Prefix these files with the time - the date (hour, minute) when it was scanned

Using DLP: model the Centre’s workflow

Current process diagrams: 

(automatically generated; leaving here for Stuart and Chloe to consider use of additional format types)

Filter by label

There are no items with the selected labels at this time.

  • No labels