By: Nafi Pouye and Mety Sahlu
Humanitarian organizations regularly collect individual survey data to assess people’s needs and respond to crisis. Such datasets, called microdata, are shared on HDX only after removing Personally Identifiable Information (PII) in accordance with the HDX Terms of Service. Nevertheless, it is possible to make either a narrow estimation of survey respondents’ confidential information or an exact disclosure by combining what are called ‘key variables’ in the microdata. As this causes ‘disclosure risk’ or ‘re-identification risk’, it is a major concern when working with humanitarian data.
Not all organizations are aware that even if PII (e.g., names and contact information) and Demographically Identifiable Information (e.g., GPS coordinates) are removed from individual survey data, a combination of key variables, such as age, marital status, and location, could point to a specific individual (e.g., a 14 year-old widow in a given camp). Over several months, the Centre team has checked all individual survey data shared on HDX. We have also reached out to organizations to inform them of the potential risk their microdata may present.
Our approach to handling microdata
To handle the microdata shared on HDX, we use an open-source software package for Statistical Disclosure Control (SDC) called sdcMicro. The tool was developed by Statistics Austria, the Vienna University of Technology, the International Household Survey Network (IHSN), PARIS21 (OECD), and the World Bank.
The SDC process in the sdcMicro is divided into three steps:
Perform disclosure risk assessment by identifying the key variables.
Apply SDC methods to reduce the risk of disclosing information on individuals.
Re-measure the risk and quantify the information loss.
When new microdata is shared on HDX, the HDX team follows the process described in the workflow below.
...
The sdcMicro tool is a useful start, but we have observed its limitations. There are only three SDC methods available for categorical variables, which are variables that take values over a finite set (e.g., gender). Also, the process can take longer for large microdata. The risk-utility trade-off between lowering the disclosure risk and limiting information loss is tricky to handle for microdata that have a high risk of disclosure.
Exploring other open-source tools
Different research institutions and statistical offices have developed generic or specifically tailored SDC tools and made them openly available to the public. Aiming to find alternatives to mitigate the shortcomings identified with sdcMicro, we explored the ARX- Data Anonymization Tool developed by the Technical University of Munich and the μ-ARGUS tool developed by Statistics Netherlands.
To compare the effectiveness of these tools in terms of computation time and scalability, we have identified five microdata based on their size and complexity and assessed their disclosure risks prior and post SDC. The SDC process in ARX is utility-focused. In μ-ARGUS and sdcMicro, since there is no advanced feature for assessing the risk-utility trade-off, human expertise and effort are more heavily required. Unlike in sdcMicro and μ-ARGUS, key variables can be automatically detected under ARX. Indeed, ARX provides a method for detecting attributes that must be modified according to the Safe Harbor method of the US Health Insurance Portability and Accountability Act (HIPAA identifiers). Compared to μ-ARGUS and sdcMicro, the risk assessment under ARX is faster for large microdata in terms of computation time.
For the selected microdata in this test, μ-ARGUS ran fairly quickly as it only required the key variables in the input file. And since it provides an automatic way of combining key variables based on their identification level, it made it simpler to work with varied combinations. However, there is a limitation in assessing disclosure risks when the number of key variables is greater than 10.
Glossary of Terms
Demographically Identifiable Information (DII) is defined as either individual and/or aggregated data points that allow inferences to be drawn that enable the classification, identification, and/or tracking of both named and/or unnamed individuals, groups of individuals, and/or multiple groups of individuals according to ethnicity, economic class, religion, gender, age, health condition, location, occupation, and/or other demographically defining factors.
Disclosure Risk/Re-Identification Risk occurs if an unacceptably narrow estimation of a respondent’s confidential information is possible or if exact disclosure is possible with a high level of confidence. Disclosure risk also refers to the probability that successful disclosure could occur.
Key Variables: Aso called “quasi-identifiers”, key variables are a set of variables that, in combination, can be linked to external information to re-identify respondents in the released dataset.
Personally Identifiable Information (PII): Also called “direct identifiers”, PII are variables that reveal directly and unambiguously the identity of a respondent, e.g., names, social identity numbers.
Statistical Disclosure Control (SDC): Statistical Disclosure Control techniques are a set of methods to reduce the risk of disclosing information on individuals, businesses or other organizations.