Special thanks to Adena Bowden for her support in delivering this series and co-authorship of this blog post.
Last Friday, December 16, 2022, we hosted the eighth meeting of our workshop series, “Data Literacy for Data Stewards,” where we discussed practices for protecting privacy when sharing data with the public. The outcomes for our discussion included (1) understanding the risks and benefits that can come from public data sharing, (2) becoming familiar with frameworks for balancing risks and benefits, and (3) learning about methods to protect people included in data.
Data privacy is a broad topic. Other areas related to data privacy we didn’t cover include personal privacy protection; privacy issues created by extractive data collection, data sharing, surveillance, and data manipulation; risks resulting from data management, data security, and data storage practices; and decisions about collecting and publicly releasing data.
To begin thinking about privacy when sharing data, we talked about what kinds of information fall into the category of Personally Identifiable Information (PII). PII is data that can directly identify an individual, like personal identifiers (e.g. name, social security number, credit card number), contact information (e.g. physical address, phone number), and medical data (e.g. insurance ID, fingerprints, signature), or that can identify an individual in combination with other data, like date of birth, gender, age, or race, and communications data.
When considering whether to make data public, it is important to assess the potential benefits along with the potential risks or harms that the information may pose to the people included in the dataset. The following graphics from the Open Data Privacy guide by the Berkman Klein Center for Internet & Society at Harvard University provides a guide for data stewards to assess benefits and risks when sharing data:
When thinking about benefits of sharing data, we must ask the following questions:
- What are some ways that there could be personal, economic, or societal benefit from sharing this data?
- Who are potential users of this information?
- What might malicious actors, governments, corporations, and even other members of the community do with the data to harm people and communities that have little power.
We shared a definition of what constitutes low, medium, and high risks from NIST.
- LOW impact means that an individual may face an inconvenience, such as having to change a phone number;
- MODERATE impacts can include monetary loss due to identity theft, discrimination, denial of benefits, and possible blackmail;
- HIGH impacts include “serious physical, social, or financial harm, resulting in potential loss of life, loss of livelihood, or inappropriate physical detention.”
Several options exist for managing risks associated with sharing data, and we used a case study from New York City to talk about the following things that could have been done to protect the identities of people in sharing a dataset of taxi trips:
- Records or fields can be removed to protect sensitive information;
- Data can be generalized e.g. decreasing the accuracy of geographic coordinates or generalizing dates and times;
- Data can be aggregated to protect individual records;
- Records can be suppressed where someone might be identified in the data;
- Anonymous identifiers can be created to prevent identification;
- Random noise can be inserted into a dataset for obfuscation;
- Other statistical techniques such as differential privacy and synthetic data can be applied to a dataset to address privacy while releasing data of value.
Participants worked together in breakout groups to apply these frameworks to the following scenario:
“You work at the County and you were asked to make a version of the 911 emergency response data publicly available to cut down on the number of Right to Know requests. The burden of responding to each one of those requests can be substantial.”
911 Data includes:
- Dispatch date and time;
- Response time;
- Address;
- Coordinates;
- Type of call (structure fire, cardiac arrest, domestic disturbance, overdose, etc.);
- Responding agency;
- Unstructured notes from the dispatcher;
We asked participants to think about what could be gained by sharing this data. They saw the following benefits in its use:
- Saving county staff time in responding to Right to Know requests;
- Promoting government transparency;
- Providing an understanding of community needs and demand for services;
- Informing public service budgets;
- Locating overdose prevention and other crisis services;
- Exploring potential alternatives to policing;
- Investigating response times.
Participants were concerned about the possible harms that could come from sharing this data. These include:
- Further stigmatizing or stereotyping communities and providing justification for disinvestment and discrimination;
- Justification for over-policing;
- Risks to domestic violence callers;
- Disclosure of health conditions;
- Monitoring emergency response to identify opportunities to commit crimes;
- People may be reluctant to call 911 due to fear of exposure;
- Disclosure of LGBTQ+ identities and safe spaces;
- Identifying individuals who placed the calls;
- Exposing people to extortion, phishing attacks, predatory practices, and solicitation of unwanted services; and
- Damaging someone’s reputation.
Participants assessed the risk/benefit of sharing the data as-is using the framework provided in the Open Data Privacy guide .
Participants proposed the following mitigation strategies to protect the privacy of the individuals in the dataset, then re-assessed risks and benefits:
- Generalizing or aggregating addresses/ coordinates (city neighborhoods, blocks, etc.);
- Redacting names/ identifiers from unstructured notes or removing unstructured notes altogether;
- Generalizing or aggregating date and time of the call;
- Generalizing the type of the call;
- Removing sensitive records or redacting calls with small numbers (K-anonymity).
In closing, participants expressed the critical need to protect the privacy of individuals represented in data. The discussion mentioned how important it is to think about who might be affected when data is made public, especially those who are most vulnerable to harm. The group suggested we all consider who is at the table (and who is not) when deciding which information to share and how, and to carefully weigh the benefits against the potential for harm.
For more information about making data publicly available, we encourage you to check out the following resources:
- Open Data Privacy guide;
- NIST Guide to Protecting the Confidentiality of Personally-Identifiable Information;
- WPRDC Data Guide page on Protecting Privacy;
- Data SF Open Data Release Toolkit
Resources covering other aspects of privacy and data protection (just scratching the surface with these links):
- Our Data Bodies Digital Defense Playbook;
- Electronic Frontier Foundation;
- Future of Privacy Forum;
- Privacy resources from the Library Freedom Project;
- OCHA Data Responsibility Guidelines
On Friday, January 6, 2022, we will continue our series with a workshop on algorithms and surveillance. Participants will learn techniques for evaluating technology and how to prevent harm from technology that is already in use.
If you are interested in participating in the next cohort of our Data Literacy for Data Stewards peer learning series starting in the first quarter of 2023, email us at wprdc@pitt.edu and we will let you know when registration is open.