Five Steps Toward Effective Data Deidentification In The Healthcare Industry
Healthcare institutions sit on vast amounts of data. This can lead to valuable insights that can greatly improve the medical space when analyzed and used for research. The Healthcare Insurance Portability and Accountability Act (HIPAA) is a centrally accepted regulation for data protection and is a standard that governs how the healthcare industry can handle data.
In this article, I will be focusing on how best to address HIPAA's Privacy Rule, which includes regulations for deidentification of protected health information (PHI). In principle, the word "deidentification" means the elimination of reidentification risk.
However, unlike other organizations, healthcare institutions' data classifications are more intricately connected, and they often depend on each other for meaning. For instance, you cannot mask the gender of a patient and expect zero reidentification risk in all cases. What if the patient is pregnant or is going through a gender reassignment surgery? In such cases, the diagnosis reveals patient information. The same concept applies to a patient's age. If a patient has a disease that can only affect a child, what use is it to mask the age to reflect an adult's age? Again, the diagnosis leads to chances of reidentification, and the relational integrity of the database collapses.
Research conducted by Latanya Sweeney in 2000 indicated that three data elements — the five-digit zip code, birth date and gender — formed a quasi-identifier that could uniquely identify 87% of the American population. You can't mask birth date and gender in the previous examples, so that leaves the zip code. But let's say you have a dataset that you're using for a research purpose like "the incidence of cancer within five miles of a cement factory." In such a case, you cannot drastically change the zip code either because you have to keep it within that five-mile range.
Consequently, one of the main challenges to deidentification in healthcare is preserving analytical integrity.
So, how can a healthcare institution maintain the lowest possible risk to privacy while ensuring it gets the maximum benefit out of its data?
First, implement a robust and comprehensive data discovery tool to capture all data elements that can potentially identify an individual.
Second, identify those data elements that you cannot change without compromising the analytical integrity of the data.
Third, come up with a list of all data elements that you need to preserve and the ones that you need to anonymize.
Fourth, choose methods that have a high degree of anonymization to eliminate reidentification risk.
Fifth, come up with a list of anonymization methods that work best for the analytical purpose for which the dataset was intended.
However, note that the same dataset can be used for different analytical purposes. For instance, the same dataset could apply to two research conditions — "the incidence of cancer within five miles of a cement factory" and "the incidence of mercury poisoning in Japanese women in the U.S." In the first case, we talked about how the zip code would have to be kept within the given five-mile range. In the second case, you cannot mask the individual's nationality without compromising the analytical value of the research. As a result, you have to develop a different ruleset for each of these research purposes, i.e., an additional list of anonymization methods.
Also note that selecting the right anonymization methods to ensure deidentification is only a piece of the puzzle. The central focus area should be to ensure the deidentified data stays deidentified — without the risk of reidentification — whether the data stays with you or you're sharing it with someone.
As Latanya Sweeney's study illustrates, given enough data elements, you can almost always reidentify someone to a high degree of certainty.
Therefore, the key behind deidentification is to always keep in mind the risk of reversibility. I believe the correct approach to deidentification would be to have multiple datasets, each to suit a specific analytic purpose, and to anonymize appropriately but never compromise the spirit of what HIPAA is asking for: irreversibility.
To understand more about the different types of identifiers and using a risk-based approach to anonymization that goes beyond the standard rules of masking, you can read parts one and two of my series on the reidentification risk of masked datasets.