ADSC Working to Analyze Biomedical Data without Risking Individuals' Privacy
ADSC is kicking off a new A*STAR-funded project that will eliminate significant privacy risks that currently impede researchers' ability to analyze biomedical data about individuals. Immense amounts of valuable data now exist that are unusable by the research community due to the lack of an effective method for concealing individuals' identities. The new ADSC work will generate new publication schemes for the results of data analyses, thus making detailed summaries of health data available that can offer unprecedented insight into a vast range of medical conditions and provide useful input for urban planners, public health officials, and researchers.
The S$2 million project, which is entitled "Enabling Mining of Medical Records through Differential Privacy," is led by ADSC Director Prof. Marianne Winslett. Her co-principal investigators include Prof. Xiaokui Xiao of Nanyang Technological University, Prof. Jiawei Han of the Department of Computer Science at the University of Illinois at Urbana-Champaign, Dr. See Kiong Ng of the Institute for Infocomm Research, and Prof. Nikita Borisov of the Department of Electrical & Computer Engineering at Illinois. The team also includes biomedical researchers Dr. E Shyong Tai from the National University of Singapore and Dr. Edison Liu from the Genome Institute of Singapore.
The widespread availability of biomedical data, ranging from reports of the locations of new cases of dengue fever to individuals' genomic variations, appears to offer researchers a tremendous opportunity. Statistical analysis of such data can help researchers and public health officials better understand a disease and its transmission patterns, gain new insights into the human body, and develop new treatments and services that can improve the quality of life of millions of people.
Unfortunately, privacy concerns make it infeasible to provide researchers with unlimited access to biomedical information. Previous attempts to solve this problem have tried to anonymize data by removing personally identifiable information from medical records, but this does not provide sufficient protection. The main problem is that external knowledge can be used to re-identify individuals whose data appear in supposedly anonymized data sets. Many ideas for mitigating the problem have been proposed, but all of them have made the unrealistic assumption that adversaries had limited prior knowledge.
"In fact, this has been shown to be a fundamental barrier," explains Winslett. "An anonymized database will either reveal private information, given certain external knowledge -- or will be useless for answering some questions."
To the extent that databases of patient information have already been made available, they have made many lifesaving discoveries possible. For example, a University of San Antonio study involving data collected from over 9,000 breast cancer patients showed that amplification of the HER-2 oncogene was a significant predictor of both overall survival and time to relapse in patients with breast cancer. This information subsequently led to the development of Herceptin (trastuzumab), a targeted therapy that is effective for many women with HER-2-positive breast cancer. Likewise, it was medical records research that led to the discovery that supplementing folic acid during pregnancy can prevent neural tube birth defects (NTDs), and population-based surveillance systems later showed that the number of NTDs decreased 31 percent after mandatory fortification of cereal grain food products. No one doubts that additional valuable findings would follow if a way to tackle the privacy limitations can be found, so that far more patient data can be made available to researchers.
To that end, medical studies funded by the National Institutes of Health (NIH) in the U.S. are required to make the data they collect, as well as summaries of analysis results, available to other researchers. Originally, the statistical summaries were freely available to other researchers via NIH's dbGaP database (http://www.ncbi.nlm.nih.gov/gap), while access to the detailed patient records required researchers to undergo a rigorous and fairly arduous approval process with their Institutional Review Boards (IRBs). Privacy concerns subsequently led NIH to restrict dbGaP access, so that today many of the statistical summaries cannot be viewed without IRB approval. The need for IRB approval is a significant hurdle for researchers who want to access the summary statistics from old studies to help them plan their future work.
To find a practical solution, the ADSC team is using the recently developed concept of "differential privacy." Differential privacy works by adding a small amount of noise to the results of statistical analyses of sensitive data sets. Under differential privacy, the contributions of any one individual's data towards the outcome of an analysis are negligible; analysis results will be essentially identical regardless of whether a particular person's data are included. This should not limit the usefulness of the results, since in a large and well-designed medical study, the history of a single individual should not have a significant impact on overall results. When analysis of a data set begins, its owners decide on a total "privacy budget" for the entire data set. Each published analysis result uses up a little bit of the privacy budget, and once the budget has been exhausted, no more results can be published, as they could open the possibility of at least one individual's data having a non-negligible impact on overall results.
"Differential privacy offers us the tantalizing possibility of being able to do privacy-preserving data analysis that is both useful and secure," says Winslett. "It's such a new concept, but the implications are immense. Whoever comes up with a practical approach to differentially private access to biomedical data -- which is what we aim to develop with this new project -- will set off a free-for-all. It will open up so many new opportunities to revolutionize treatments and reduce health care costs."
The new project is starting by analyzing privacy issues in the statistics released by Singapore's Ministry of Health (MOH). Due to the potential for privacy breaches, MOH currently publishes detailed statistics only for highly dangerous infectious diseases, and only very sketchy information, such as the total numbers of male and female patients in Singapore, for other types of diseases. For instance, one report says that there was exactly one male patient aged 30-39 with relapsed tuberculosis in 2010. The team's goal is to make it possible to publish detailed statistics for all diseases, but with strong privacy-preservation guarantees.
The project's next step will be to investigate ways to re-enable open access to the summary information in dbGaP by making the summary tables differentially private. The researchers will also target other custodians and users of health-related statistics in Singapore. That work is projected to include applications in pharmacoeconomics and in analysis of hospital records to reveal the effectiveness of different treatments for a disease.
Winslett is quick to point out that several fundamental research challenges remain before differentially private analyses will be practical, but she is optimistic that ADSC has advantages that make it an ideal location for this research. In particular, Singapore is unique in its close cooperation among the government, the medical fraternity, and research institutes. This will give the ADSC researchers exceptionally good access to the parties who have a vested interest in broader dissemination of health data summaries. This concerted effort to bring together medical researchers, computer scientists, and medical records could one day enable Singapore to be a world leader in technologies for analyzing sensitive data.
The Advanced Digital Sciences Center (ADSC) is a research center in Singapore for University of Illinois faculty that is funded by Singapore's Agency for Science, Technology and Research (A*STAR) to do research in the areas of interactive digital media and power grid information technology. The new data mining grant is part of a growing portfolio of research funds awarded to ADSC outside of its core A*STAR funding.