911 Dispatch Data Cleaning

by Ellie Newman

February 4, 2019

Ever since WPRDC was launched as our Pittsburgh-area open data portal, 911 call and dispatch data has been one of the most requested datasets. And no wonder: by looking at the time and location of different types of emergencies, community members, researchers, nonprofits and others can answer important questions about their community. What areas have the most vacant building fires? Has the overdose crisis eased off over the past few years?

Allegheny County’s 911 center handles emergency dispatch services for 111 of the 130 municipalities in the county. Trained operators take information about the call, determine the location of the emergency, and dispatch emergency responders according to the priority level of the call. The value of the dataset is obvious, and we initially thought it would be a quick release. But a closer look revealed some thorny questions about privacy and re-identification.

The call description field was extremely detailed: so much so that even the age and medical information about the victim was sometimes revealed. Combined with precise location information, someone could potentially re-identify people using their own knowledge about ambulance calls to an area. For example, here’s a snapshot of some actual call types found in the CAD:

Sample call descriptions in CAD

Our goal was to find a method of aggregating the data that kept useful information that people cared about, but not release enough detail to compromise anonymity. We knew that was wasn’t so much the detail in any one field, but rather the combination of fields in a record, that threatened privacy. The three most potentially identifying pieces of data in each record was the date/time, location, and nature of the emergency. Of course, those were also the most interesting and useful fields–so we needed a way to aggregate each field just enough to de-identify, but not so much to whitewash the data.

We tested various combinations of data aggregation, and ultimately settled on the following:

  • Geocode to census block groups. Remove address and lat/long.

Downtown Pittsburgh-area census block groups. Block groups generally have about 600-3,000 residents.

  • Keep only quarter and year. Remove exact date and time.
  • Shorten call types, in most cases taking the first couple words before the dash in the above sample.
  • Sexual assaults were grouped with assaults.
  • Resulting “short” call types were reviewed by our data privacy team, and some were marked as relating to protected health conditions, such as pregnancy, diabetes, and mental health issues

Finally, records were tested for unique combinations of census block group-quarter-year. If the number of calls in a block group for one quarter was less than five, and the short call types could be considered sensitive (e.g. not fires or traffic-related), then the call type was removed.

You can still make a detailed map with this data, or analyze trends over time. But if you’re looking up ambulance calls to your street to find out if your neighbor had a heart attack, you’ll be foiled–there will be too many records, blurred just enough, to tell what exactly that call was for.