Special thanks to Adena Bowden for co-authoring this post.
On Friday, November 4, 2022, we met for the fourth installment of our twelve-week virtual data literacy series, “Data Literacy for Data Stewards,” where we talked about the importance of having contextual information about the data that we work with. Without context, it can be difficult to know who is creating data and how data is created. It can also be hard to identify how data may be biased, inaccurate, or incomplete. Additionally, a lack of context can reinforce the status quo.
To start the session with a fun activity, we presented a few datasets that were stripped of important context (titles, column names, and some fields) and asked participants to try to identify what the datasets were representing. Here’s an example – can you guess what it is?
This dataset is from the NYC Squirrel Census, a multimedia science, design, and storytelling project in New York City’s Central Park.
This activity, while fun, also pointed out the importance of providing context around data. Data is connected to the context of a place, and what might look like the same data can come from very different sources where people have very different ways of doing things. Context about data is needed if people are going to evaluate data quality and potential biases and use data to challenge power.
Some important questions that can be asked to uncover context about data include:
- Motivation: Why does this data exist, who created it, and what’s the history?
- Composition: What’s included or represented in the data? What isn’t? Is any of the data sensitive? Do any standards, laws, or policies apply to this data?
- Collection Process: How was the data collected or created? Who did the work? What tools were used? Was consent or fair compensation offered?
- Transformations: How was the data generated, processed, cleaned, and labeled? Were any algorithmic tools used?
- Uses: Who uses the data internally and externally? How is it used by people in different contexts? Are there any potential use-cases?
- Sharing: Is the data shared? Is it publicly available? What rules exist around sharing
- Maintenance: How is the data and data infrastructures funded, managed, and maintained?
After being introduced to these types of questions, participants worked together in breakout groups to practice developing interview questions of a data steward from one of the five scenarios presented in the workshop. An interview with someone that knows a lot about the context of a dataset can uncover many important details. We encouraged participants to think about questions whose answers could (1) highlight structural causes of an issue, (2) uncover data quality issues, biases, and limitations, (3) expose power, privilege, and hypocrisy, and (4) enable people to contest unjust systems and claim power.
Several of our breakout groups chose the following scenario tied to a 3-1-1 non-emergency service request system.
The City Manager in your community has proposed a new budgeting system that allocates a portion of the budget to the number of service requests made to the new 3-1-1 system. 3-1-1 streamlines the reporting processes for non-emergency service requests for service, which include pothole repair, vacant lot cleanup, snow removal, and playground maintenance. As someone who works to provide social services in many communities where people have low incomes, you’re worried about the biases that may result in the use of this data to determine budget allocations through a formula or algorithm. The city council representative in District 10 shares your concerns and has scheduled a call with you and the director of the 3-1-1 program.
Here are some of the questions our participants felt were important to ask the 3-1-1 data steward:
- Why does this data exist? What is the purpose of 311 service requests?
- How do you currently allocate a budget to address complaints?
- How are you relating the number of 3-1-1 complaints to the budget?
- What do you think the number of service requests tells us about budget needs versus other types of data points?
- What data is in the 311 data set? (I.e., location, resident/business owner/etc., variables about the person, etc.?) What might be missing or overrepresented?
- How do people know about 3-1-1? Who uses 3-1-1?
- What are the most common types of complaints?
- Where are complaints addressed the fastest? (Is it in higher income neighborhoods?)
- How is the geography of the caller captured?
- How are duplicate requests handled in the system?
- What barriers may exist for people to access and use the 311 system?
- What percentage of tickets are submitted via phone versus online?
- What are the pre-defined categories? What happens with the unique calls and how do they get sorted and included in the data?
- Is there a difference in recording/reporting methods depending on who answers the 3-1-1 call that day?
- How does the data collection differ in well- and under-represented areas? What efforts might have been made to correct imbalances identified?
- Are you planning on making inferences to underrepresented areas based on other data?
- Are you able to attribute the number of complaints to a single source? Is this a neighborhood issue or a personal one?
- How are 311 requests distributed across the city? Have calls ever been mapped?
- Is the data publicly available?
- Are follow-up complaints reported in the data set too? Can they be separated out?
- Who can see 3-1-1 data? Are there any publicly facing dashboards or other sharing of the information?
- Are there reports issued on requests received and resolved?
- How many people use this for community development planning?
- How is personal information protected?
- How many complaints have been successfully addressed? And in what categories?
- How often is the data updated?
- Are data audited for quality? How often?
- What additional information will inform the budget allocations?
- How does the 311 data align or not align with actual investments in cleanups, physical infrastructure, etc.?
To end our session, we brainstormed practices we can build to capture important context about data. Here’s what participants shared:
Practices for people that have some power or control over data
- Consider how reports will affect different populations (especially those that experience oppression) and the unintended consequences that might result from sharing data
- Involve the community in data collection
- Pay people fairly to collect data and share information
- Practice “using” data before the data collection process is finalized
- Talk with people who will be affected by the decisions made based on the data collected
- Shift power to those who don’t historically have the power
Practices for people that don’t have power or control over data
- Ask about how the data is collected and the reason(s) behind the methodology
- Think of the media format used to capture the data and what biases or barriers that format might introduce
- Advocate for data transparency
- Ask hard questions of folks who do have power over data that can reveal inequities and bias
- Record limitations in data
- Ask if you can be involved in data analysis and dissemination
- When using data without context, ask for context; if you can’t get context, be transparent about the context that is missing and how it impacts the analysis/project
Reading list / sources:
- “Datasheets for Datasets” by Gebru et al.
- Data Feminism by D’Ignazio and Klein
- All Data Are Local by Loukissas
For examples of data guides to capture and share context, we encourage you to check out the following:
- Allegheny County Property Assessments – Property Assessment Data User Guide Allegheny County – WPRDC
- WPRDC Data Guides
- A Guide to Home Mortgage Disclosure Act Data | Urban Institute
- Crime and Punishment in Chicago (archive.org)
This week (November 18, 2022), we will continue our Data Literacy series with a discussion on classification and representation. We’ll explore the different classification systems we use for data, and the values, assumptions, and power hierarchies that are embedded within them. We’ll work together to explore how we can maximize value and minimize harm from the use of classification systems in data and create practices that accurately represent how people want to be represented in them.
If you are interested in participating in the next cohort of our Data Literacy for Data Stewards peer learning series starting in the first quarter of 2023, email us at email@example.com and we will let you know when registration is open.