Vulnerability of anonymous gene databases to data breaches

Transparenz: Redaktionell erstellt und geprüft.
Veröffentlicht am

A new study shows that anonymous genetic databases are vulnerable to identity theft and data breaches. Researchers warn of the consequences.

Eine neue Studie zeigt, dass anonyme genetische Datenbanken anfällig sind für Identitätsdiebstahl und Datenschutzverletzungen. Forschende warnen vor den Folgen.
A new study shows that anonymous genetic databases are vulnerable to identity theft and data breaches. Researchers warn of the consequences.

Vulnerability of anonymous gene databases to data breaches

A study has raised concerns that a type of genetic database increasingly popular among researchers could be exploited to reveal participants' identities or link private health information to their public genetic profiles.

Single cell datasets can contain information about gene expression in millions of cells collected from thousands of people. These data are often freely available and provide a valuable resource for researchers studying the effects of disease at the cellular level. The data is said to be anonymized, but a study published on October 2nd in the journal Cell 1 shows how genetic data from one study “can be exploited to uncover private information about individuals in another study,” the authors write.

The results highlight the difficulty of balancing researchers' interests with donor privacy. "Our genomes are very identifying. They can say a lot about us, our characteristics and our susceptibilities to disease," says study co-author Gamze Gürsoy, a bioinformatics researcher at Columbia University in New York City. “You can change your credit card number if it becomes public, but you can’t change your genome.”

Sensitive data

Privacy concerns in genetic datasets have been raised before, but have focused primarily on “bulk data” of genetic profiles. These contain information about gene activity averaged across a large cell population rather than individual cells.

It was previously thought that single-cell datasets would not be as vulnerable to data breaches because of the level of "noise," or variation in gene expression, between different cells. But Gürsoy and her team were able to prove that this is not the case.

The team examined three publicly available single-cell datasets that included blood cells from people with lupus, a chronic autoimmune disease. The researchers found that they could use gene expression data to predict the structure of a person's genome by combining these values ​​with information about expression quantitative trait loci (eQTLs). The details of eQTLs – variations in the chromosome that correlate with gene expression – are also publicly available in single-cell datasets.

To test the reliability of their work, the researchers checked their genome predictions against a genome database that corresponded to the cells used. They were able to link most datasets to the corresponding genome, with an accuracy rate of over 80%.

Unlike gene expression data and eQTLs, full genome databases can typically only be viewed by scientists to protect donors' identifying information. However, the researchers note that a participant's genomic data could be publicly available elsewhere. For example, they might have uploaded them to a genealogy website where users submit DNA samples to learn more about their ancestry. In this case, an attacker could identify a person whose cells are in a single-cell dataset by analyzing their genome. This could reveal personal data associated with a sensitive characteristic such as a psychiatric disorder, as research participants are often selected to study the biology of these complex conditions.

Data breaches like this could have real consequences, such as discrimination in the workplace, says Gürsoy. She adds that leaks could even impact future generations because genetic traits can be passed on to offspring. “Everything that is known about us is passed down through generations,” she says.

Bradley Malin, who researches large-scale genomic data sharing at Vanderbilt University in Nashville, Tennessee, describes the study as a "novel addition and contribution to the literature." He adds that future research could explore whether genomic data could also be linked in larger datasets containing samples from thousands or millions of people.

competition interests

Scientists are unsure how best to address privacy concerns. “There is a desire to protect individual privacy but also a desire to advance medical research collectively, and unfortunately these are at odds with each other,” says Mark Gerstein, who researches medical data science at Yale University in New Haven, Connecticut. The simplest solution would be to make genetic data more difficult to access, but that would negatively impact research, he says. “We need to share and aggregate large amounts of information,” he explains. “If we block everything and make it more private, it really hinders the whole process.”

In their study, Gürsoy and her colleagues call for greater transparency about the risks to participants who share their genomic data and suggest that researchers should ensure that donors consent to sharing their data. Another possible route could be to encrypt personal data if it is part of a public database. The authors acknowledge that this would complicate the process of creating and maintaining records, but believe it could help protect participants' privacy.

  1. Walker, C.R. et al. Cell https://doi.org/10.1016/j.cell.2024.09.012 (2024).

    Article
    Google Scholar

Download references