THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

De-identification of Personal Health Information

Many people have asked me various questions about Personal Health Information (PHI) which is covered by the Health Insurance Portability and Accountability Act (HIPAA, not HIPPA). The process of removing data (de-identification or anonymization) that could violate someone’s privacy is  complex. This is particularly true when there is unstructured data (i.e., free text). The U.S. Department of Health and Human Services has detailed guidelines on proper de-identification techniques, which are found here or at Bing’s cached copy here.

I found the guidelines to be very informative. The discussion on zip codes was interesting. Zip codes, particularly in areas that aren’t densely populated, have to be abbreviated to the first three digits. Even when you restrict a zip code to the first three digits, there is a list of 17 specific three digit zip codes that you cannot use at all.

The document says that age must be removed from a patient’s record if the patient’s age is greater than 89. Can you imagine a patient summary beginning with “The patient is a 107 year old man…”? For supercentenarians, age does provide a clue as to who they are. Changing a patient’s age or date of birth helps greatly in de-identification, but care must be taken. You don’t want to make an adult a minor or vice-versa.

Consider this statement in the medical record: “The patient became ill after eating a [insertNameOfReligiousHolidayHere] meal.” One could argue that removing the name of a religious holiday makes for a neutral record. That might be the appropriate thing to do, but there could be clinical value in knowing the religious holiday or the religion of the patient. It could be useful to know if certain things would be eaten or definitely not eaten.

There are published algorithms for processing textual data and de-identifying it. You can download Perl regular expression scripts from PhysioNet for free. The download also includes several dictionaries that the scripts use. Notice there is a dictionary of medical terms and several dictionaries of people’s names. Obviously there is value in knowing if a word is a person’s name or a medical term. DeBakey appears in the SNOMED dictionary as a medical term. There is a DeBakey clamp, a DeBakey pump, and a DeBakey graft. But what if the patient’s name was DeBakey? DeBakey does not appear in the dictionary of common names. Would the scripts recognize DeBakey as a medical term and not remove what actually is the patient’s name?

A known weakness of processing textual data against dictionaries is misspelled words and names. Some names are particularly difficult to spell and will not always be found in the dictionaries because of the inevitable misspellings.

Published Wednesday, July 10, 2013 11:25 PM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Alex Thomas said:


I think there may be an issue with the link to the US Dept of Health guidelines on de-identification techniques.

July 15, 2013 10:52 PM

John Paul Cook said:

If you can't reach the HHS site, you can use the link I provided above to Bing's cached copy.

July 15, 2013 11:23 PM

Virginia said:

Thanks, John, very informative article. To clarify Zip codes requirement, please note: the number of zip codes that you "cannot use at all" varies year to year as it has to do with the population density in a given year.  You can use these zip codes but you will have to change population statistics associated with them: in essence, you simply need to make sure zip codes can't be recognized based on the population data statistics.

The rule of HIPAA described here is called "Safe Harbor" and it is one of the simplified implementations of the requirement for the safety of the solutions. HIPAA allows an alternative of a qualified professional to assess the safety of the solutions even if they are implemented using different methods.

September 4, 2014 1:59 AM

Leave a Comment


About John Paul Cook

John Paul Cook is a database and Azure specialist in Houston. He previously worked as a Data Platform Solution Architect in Microsoft's Houston office. Prior to joining Microsoft, he was a SQL Server MVP. He is experienced in SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. John is also a Registered Nurse currently studying to be a psychiatric nurse practitioner. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2. Connect on LinkedIn

This Blog



Privacy Statement