Census 2020 +/- 2: Census, Differential Privacy, and the Future of Data
by Justin Hong
The decennial census is mandated in the United States Constitution for the purpose of apportioning members of the House of Representatives among states (Article 1 Section 2). Every 10 years, from the initial U.S. Census in 1790 to date, a census of the population has taken place. These data are an essential component of our representative government in the United States. Further, these data have been used for other purposes outside of the constitutional mandate—for example, in 1790, Census data were used to assess the country’s industrial and military potential and more recently to distribute federal funds such as those for the Medical Assistance Program and Federal Pell Grant Program. Given the consequential nature of Census data, accuracy is of utmost importance.
However, with the development of new statistical techniques, computing technology, and increasing availability of data, data accuracy presents a new challenge—privacy—that is in direct opposition with the goals of accuracy. Throughout its history, the Census Bureau (or those authorized to carry out its mandate) has essayed to collect and report accurate data; meanwhile the call to protect the privacy of respondents has grown. The opposing goals of accuracy and privacy have led the Census toward a new disclosure avoidance method—differential privacy—that aims to protect privacy and provide reasonably accurate data to users. This change signals a new phase of disclosure avoidance and data reporting at the U.S. Census Bureau. Indeed, this may be the beginning of a new world of privatized data in general.
A Brief History of Data Accuracy and Data Privacy
Obtaining a complete enumeration of the population has always been a primary concern of the Census Bureau. Accuracy has, therefore, been a pursuit of the Census since its beginning. The Census has gone to great lengths to ensure an accurate count of the population and has focused on improving all parts of the process, including: training of census takers, marketing to increase awareness, and conducting post-enumeration surveys to determine the extent of under-enumeration. The Census has made other efforts to count individuals who are often undercounted. For example, individuals staying in homeless shelters, soup kitchens, bus and rail stations, and dormitories as well as individuals living in hotels or motels on a permanent basis. The Census continues to work to provide an accurate enumeration of the population.
In contrast with data accuracy, privacy protection has a different history. For the original 1790 Census, there were no privacy protections. It wasn’t until 1840 that any assurance of confidentiality was given, but those were given only to businesses. That assurance was made into law 70 years later in 1910. And in 1940, another 30 years later, those disclosure protections were extended to people. In 1954, Census privacy laws were consolidated in Title 13 of the U.S. code. The law provides that: 1) the data are used for their intended purpose, 2) individuals in the data cannot be identified, and 3) no unauthorized individual can examine the raw data. To comply with the requirement that individuals responding to the Census cannot be identified, the Census Bureau has employed numerous techniques to mask the identity of “discoverable” individuals in the Census data. These techniques include data suppression, compression, rounding, top-coding, and swapping. The Census has determined, however, that these techniques are no longer sufficient to ensure the privacy of data.
Differential Privacy
The histories of data accuracy and data privacy have come together at the Census in the form of a mathematical formula that quantifies both privacy and accuracy simultaneously. The amount of privacy can be increased, but only at the expense of accuracy. The Census refers to this method of avoiding the disclosure of individual identities in the data as differential privacy.
Differential privacy as a method is not a single approach; rather it is a “criterion that many tools for analyzing sensitive personal information have been devised to satisfy.” To satisfy this criterion, a pre-specified amount of random noise is added to the data. The random injection of noise into the data means that 1) one cannot determine the exact value of an output dataset or statistic beforehand and 2) the results, after injecting noise, would be different were one to duplicate the process. In aggregate, the infusion of noise into a dataset often does not make a substantial difference to conclusions drawn from the data—particularly because by adjusting the amount of noise added, we can ensure that our conclusions will not be affected.
However, when analyzing smaller subsets of the data, there is a real possibility that our conclusions will differ significantly from those drawn from the original data. This is because picking a random number from a specified distribution contains a real probability that a number near the extremes (tails) of the distribution is selected. Thus, differential privacy is limited in the amount of granularity it can provide for analytical purposes. This limitation has real consequences for a state like Hawaii.
Implications for Hawaii and the Future of Data
Hawaii is unique in terms of its history, culture, place, and—as it relates to Census data—people. Using the Office of Management and Budget definitions of race, Hawaii is the only state where Whites do not make up a majority of the population. In fact, over 40 percent of the population is Asian and more than 1 in 5 are Native Hawaiian. Further, people from all over the world, from Polynesia and Micronesia to Japan, Korea, and the Philippines, call Hawaii home. Geographically, Hawaii residents live on seven different islands. Each island is unique in its diversity and communities. For example, approximately 40,000 people live in Waipahu, nearly 80 percent of whom are Asian; compared with a population of just over 2,000 people who live in Hana, where 30 percent are Asian, but nearly 60 percent are Native Hawaiian. Other areas, such as Hickam Housing (with a population of nearly 7,000) are more than three-quarters White due to the military population there.
The diversity of the people in Hawaii and the islands’ unique communities of various sizes means that a significant number of individuals in smaller subpopulations (either demographic or geographic) may “disappear” or “move” according to differentially private Census data. For example, Pepeekeo on Hawaii Island had a population of 1,789 people according to the 2010 Census. However, the differentially private “count” of the population is 1,185, resulting in a loss of 604 individuals (one-third of its population). Further, the 12 and six individuals reporting their race as Black or African American Alone and American Indian and Alaska Native alone, respectively, disappeared altogether and were reported as zero in the differentially private data. This is a more extreme example of a small geographic area, but issues may arise when analyzing data at even the county level.
For example, a population pyramid is useful to understand the structure of a given population. When looking at the differentially private population pyramid of Native Hawaiians and Other Pacific Islanders alone in Kauai County, we find that the structure of the population is distorted even though the total population does not differ dramatically between the original and differentially private counts (see Figure 1). For example, in the differentially private data, 117 males ages 5 to 9 disappeared, while 139 males ages 70 to 74 appeared. Other shifts similar to this are reported in the differentially private data for the 2010 Census, which results in an abnormal looking population pyramid.
It is important to note, that the results and examples given above are based off of the Census Bureau’s 2010 Census Demonstration File, which is intended to elicit feedback from users to inform their on-going development of the Census’s disclosure avoidance system. As such, these results are not final and the Census has been responsive to user feedback. The hope is that in responding to such feedback, the Census will provide accurate enough results (from the new disclosure avoidance system) so that all subpopulations and communities are able to utilize the data to benefit their respective groups. It seems likely, however, that at least some subpopulations and communities will not be accurately reflected in the published Census data.
So, what should data users do when faced with the problem of inaccurate data? First, we should acknowledge that the data landscape is changing and will likely continue to change. Acknowledging this reality will allow us to prepare and devise strategies for accessing and utilizing the data appropriately. Before examining the data, there are several questions we should answer:
What is our purpose?
What specific question or questions do we need answered from the data?
How will we use the answer to inform a decision or action?
What level of accuracy is required of the data to be confident in our decision or action?
Do the data provide enough accuracy for us to be confident?
The answers to these questions help us understand 1) which data to analyze, 2) how the results ought to be used, 3) the level at which the data should be analyzed and reported, and 4) the level of confidence we can have in any interpretation of the data.
We are not accustomed to answering questions 4 and 5 above when looking at census data. These data are not reported with measures of reliability, as they are supposed to be a complete count of the population (though they are not, but this will be discussed in a later post). As the new census data will be reported with error added to the original counts[1], it is helpful to think about the upcoming census data as “estimates” with an associated margin or error. In this way, we have a template for thinking about and treating new census data.[2]
While one cannot predict the future, it seems likely that this change at the Census is the beginning of a new trend in data privacy protection. If this is true, it means that data users will need to become more sophisticated in both planning for and interpreting the data, or rely more heavily on those familiar with these kinds of data.
[1] Not all, but most, of the census data will have error added to the counts. The total state population is an example of a count that does not have added error.
[2] As of the writing of this post, it is not clear whether the Census will be publishing measures of error associated with the reported statistics. However, the recommendation to do so has been made.
Justin Hong is a consultant to the Hawaii Data Collaborative.