Privacy, accuracy, and the looming 2020 census
Loading...
Data from the Census Bureau is used for everything from emergency planning to political redistricting. That’s why it needs to be accurate.
It also needs to be private. By law, census data needs to shield individual identities.
Why We Wrote This
While accuracy is important in a head count, so is individuals’ privacy. The Census Bureau has changed its process as it works to ensure identities are shielded – but some are concerned the data won’t be as useful.
But as computers get faster and hackers develop new tools, it’s getting harder for census officials to balance accuracy and privacy. For the 2020 census, they’re turning to a cutting-edge concept used by Apple and Facebook: “differential privacy.”
This is a framework that introduces false data, “noise,” into census results, and then allows officials to examine how much security is added.
It’s like turning a dial. More noise equals more privacy, but less accuracy.
Census officials say they need this approach. They’ve long fuzzed data, but not formally. Old “disclosure avoidance methods” were “more art than science,” says chief scientist John Abowd.
But some users of census data are concerned results will become too inaccurate to use. Science, livelihoods, even lives might be at stake, they say.
“This is more broadly about trust in government. The most important thing for ensuring a fair and accurate count in 2020 is trust,” says Indivar Dutta-Gupta, of the Georgetown Center on Poverty and Inequality.
Ten seconds pass as Joe Salvo considers how New York City, his longtime home, could prepare an emergency response strategy to a hurricane – without using census data.
He draws a blank.
“Where do you go first? Where you send resources that are by definition limited?” says the head of the population division at the city’s planning department. “We would try to come up with something to figure out where the resources should go, but it would always involve census data. There’s no other source like this.”
Why We Wrote This
While accuracy is important in a head count, so is individuals’ privacy. The Census Bureau has changed its process as it works to ensure identities are shielded – but some are concerned the data won’t be as useful.
But soon that source might be less useful than it once was. Demographers, social scientists, and other data users like Mr. Salvo are concerned that their ability to draw upon 2020 census data for city planning and other routine purposes could be affected by a Census Bureau decision to adopt a more rigorous system to ensure and measure privacy for survey participants.
Census scientists say this framework – known as “differential privacy” – is necessary to help ensure hackers can’t take census data, mash it up with other public datasets, and identify individuals. But the shift has proved divisive. Critics argue that privileging the data’s privacy undermines its accuracy and utility and restricts the general public’s ability to access it. They say it could jeopardize everything from the data that cities use to prepare for natural disasters to the redistricting process that determines United States representatives.
“It’s not just the most important source in the social sciences,” says Steven Ruggles, a University of Minnesota history and population studies professor. “It’s one of the most widely used scientific resources in the world.”
The Census Bureau has long used ad hoc tweaks to balance accuracy and privacy in its data. Years of technological innovation and highly public data breaches have complicated that task. Switching to differential privacy allows the bureau to better see just how much privacy protection might be needed. While tightening up could rankle data users, a privacy breach could erode public trust – something in short supply since the Trump administration tried to defy the Supreme Court and add an unpopular citizenship question.
“The census is only as good as the public’s willingness to participate, and that participation hinges significantly on perceptions that the Census Bureau will keep personal information confidential,” says Terri Ann Lowenthal, a former congressional aide for a House subcommittee that oversees the census. “If public confidence in the confidentiality of data erodes, then participation in the census will likely decline.”
The case for differential privacy
By law, the U.S. Census Bureau is required to protect the privacy of individuals and establishments, ensuring that they can’t be identified from census-released data.
That used to just mean withholding names. But as computers became more powerful, it became theoretically possible to combine outside databases such as property records, credit reports, and voter rolls with census data tables on age, ethnicity, geography, and so forth to try to get a statistical picture of actual Americans.
To guard against this, the bureau has long injected inaccuracies, or “noise,” into the data. It wouldn't release details on such efforts to prevent reverse engineering of the process.
The bureau’s old “disclosure avoidance methods” were “more art than science,” says John Abowd, the chief scientist.
Meanwhile, computer science has continued to advance. The bureau’s interest in differential privacy has been pushed in part by development of something called the database reconstruction theorem. This theorem holds that given enough information, researchers can take collections of summary tables and reconstruct them into approximate records of individuals.
The bureau has never had a data breach, but internal tests of the 2010 census showed that the bureau had been able to match the race and ethnicity of nearly 20% of the 308,745,538 people counted using publicly available information. Corroborating the data’s veracity required access to confidential bureau information, which limits the data’s utility and potential harm, but the very fact that the data was vulnerable frightened researchers across the field.
Guarding against this is a big reason the Census Bureau has turned to differential privacy, which offers data scientists a tantalizing prospect: a way to confidently measure the extent of data’s confidentiality.
Popularized by tech companies like Apple and Facebook, differential privacy is a system that formally injects noise into data and then produces a numerical value that describes how much privacy loss a person will experience with a given noise amount. The term “epsilon” is used to symbolize this value.
It’s a trade-off: More noise means more privacy, but less accuracy. Less noise means less privacy, but information might be more usable to researchers.
“The decision to balance accuracy with privacy is not a scientific or technical decision. It’s a political, moral, ethical decision,” says Indivar Dutta-Gupta, co-executive director of the Georgetown Center on Poverty and Inequality. “People whose lives, livelihoods, and well-being depend on the data will generally share both goals of ensuring that privacy is protected and that there is some accuracy. And we need to sort of think through where we strike that balance.”
Most people agree that the bureau needs stronger confidentiality protections, but there is still debate over how much noise to inject. Differential privacy will require the bureau to navigate the trade-off between privacy and accuracy, as one would fiddle with a knob to adjust the temperature of a bath.
“When [the Census Bureau] puts out a public table, it can’t ignore the fact that someone could try to match things with outside data,” says Mr. Dutta-Gupta. “As far as I can tell, differential privacy is the only way to think this through. I think a lot of the concerns that people have about it could be addressed in part just by how you spend the privacy budget [and] where you set the epsilon.”
Why differential privacy may be an overreaction
Where the bureau sets the epsilon matters a lot to Andrew Beveridge, a sociology professor at Queens College and the Graduate Center of the City University of New York. He uses a lot of redistricting data in his research and worries that differential privacy will introduce so much noise as to make it unusable.
“It would be fine if it would behave like real data; I have no problem with that,” he says. “I just ... don’t know if it does [behave like that]. There still is some evidence that it doesn’t.”
Dr. Beveridge even suggests the bureau could be sued if the data is too fuzzy. Dr. Abowd downplays that concern and says redistricting data will be as robust as in previous censuses. What’s changed is that the noise injection is public.
Mr. Salvo of the New York planning department has accepted the inevitability of noise, but he wants the bureau to “empirically demonstrate that [the noise] will not damage what is the essence or the mission of the bureau: to give us data.”
Other experts call it a “radical reinterpretation” that too greatly privileges confidentiality.
“I think the Census Bureau chief decision-makers are underestimating the possibility that there could be a real crisis in confidence if at some point in the future, it’s discovered that differential privacy caused the Census Bureau data to be relied upon when it was in fact not accurate,” says Jane Bambauer, a law professor at the University of Arizona.
In late September, the bureau will release test products based on the 2010 census to show how different levels of noise might affect the data. It has some data users cautiously optimistic, but the bureau won’t provide any answers for how differential privacy will apply to the more granular American Community Survey, which is more widely used by data users and the public.
“It would have been helpful for the Census Bureau to convene its key stakeholders who are data users and other experts in the technology and data fields before it publicly rolled out its plan for differential privacy,” says Ms. Lowenthal, the former congressional aide. “It would have been better to have more buy-in before announcing a plan and I think would have gone a long way towards ensuring public confidence in whatever final method the bureau settles on.”
To his credit, Dr. Abowd largely agrees with that assessment.
“It took awhile for all of us at the Census Bureau to understand how to message this to very diverse interest groups,” he says. “I think our willingness to continuously improve the way we’re doing that should be taken as evidence that we understand we haven’t always effectively communicated the message.”
What is decided here will have far-reaching consequences, says Mr. Dutta-Gupta.
“This is not just about the census, it’s about other surveys, it’s about future censuses. And this is more broadly about trust in government. The most important thing for ensuring a fair and accurate count in 2020 is trust.”