Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
Background. Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. Methods. We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results. Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.
Authors
Araujo, Jose Deney; Santos-e-Silva, Juan Carlo; Costa-Martins, Andre Guilherme; Sampaio, Vanderson; de Castro, Daniel Barros; de Souza, Robson F; Giddaluru, Jeevan; Ramos, Pablo Ivan P; Pita, Robespierre; Barreto, Mauricio L;
External link
Publication Year
Publication Journal
Associeted Project
User-friendly computational Tools
Lista de serviços
-
Gene regulatory and signaling networks exhibit distinct topological distributions of motifs.Gene regulatory and signaling networks exhibit distinct topological distributions of motifs.
-
Gene signatures of autopsy lungs from obese patients with COVID-19.Gene signatures of autopsy lungs from obese patients with COVID-19.
-
Network Medicine: Methods and ApplicationsNetwork Medicine: Methods and Applications
-
ACE2 Expression Is Increased in the Lungs of Patients With Comorbidities Associated With Severe COVID-19.ACE2 Expression Is Increased in the Lungs of Patients With Comorbidities Associated With Severe COVID-19.
-
Drug repositioning for psychiatric and neurological disorders through a network medicine approach.Drug repositioning for psychiatric and neurological disorders through a network medicine approach.
-
Linking proteomic alterations in schizophrenia hippocampus to NMDAr hypofunction in human neurons and oligodendrocytes.Linking proteomic alterations in schizophrenia hippocampus to NMDAr hypofunction in human neurons and oligodendrocytes.
-
In-depth analysis of laboratory parameters reveals the interplay between sex, age, and systemic inflammation in individuals with COVID-19.In-depth analysis of laboratory parameters reveals the interplay between sex, age, and systemic inflammation in individuals with COVID-19.
-
The evolution of knowledge on genes associated with human diseasesThe evolution of knowledge on genes associated with human diseases
-
Network vaccinology.Network vaccinology.
-
Pyruvate kinase M2 mediates IL-17 signaling in keratinocytes driving psoriatic skin inflammationPyruvate kinase M2 mediates IL-17 signaling in keratinocytes driving psoriatic skin inflammation
-
Transcriptome analysis of six tissues obtained post-mortem from sepsis patientsTranscriptome analysis of six tissues obtained post-mortem from sepsis patients
-
Gene Signatures of Symptomatic and Asymptomatic Clinical-Immunological Profiles of Human Infection by Leishmania (L.) chagasi in Amazonian BrazilGene Signatures of Symptomatic and Asymptomatic Clinical-Immunological Profiles of Human Infection by Leishmania (L.) chagasi in Amazonian Brazil