Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
Background. Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. Methods. We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results. Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.
Authors
Araujo, Jose Deney; Santos-e-Silva, Juan Carlo; Costa-Martins, Andre Guilherme; Sampaio, Vanderson; de Castro, Daniel Barros; de Souza, Robson F; Giddaluru, Jeevan; Ramos, Pablo Ivan P; Pita, Robespierre; Barreto, Mauricio L;
External link
Publication Year
Publication Journal
Associeted Project
User-friendly computational Tools
Lista de serviços
-
Genomic analyses reveal broad impact of miR-137 on genes associated with malignant transformation and neuronal differentiation in glioblastoma cells.Genomic analyses reveal broad impact of miR-137 on genes associated with malignant transformation and neuronal differentiation in glioblastoma cells.
-
RNA-Binding Protein Musashi1 Is a Central Regulator of Adhesion Pathways in Glioblastoma.RNA-Binding Protein Musashi1 Is a Central Regulator of Adhesion Pathways in Glioblastoma.
-
MicroRNA Transcriptome Profiling in Heart of Trypanosoma cruzi-Infected Mice: Parasitological and Cardiological Outcomes.MicroRNA Transcriptome Profiling in Heart of Trypanosoma cruzi-Infected Mice: Parasitological and Cardiological Outcomes.
-
Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-specific patterns and enrichment in genes related to regulation of transcription.Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-specific patterns and enrichment in genes related to regulation of transcription.
-
Antimicrobial peptide LL-37 participates in the transcriptional regulation of melanoma cells.Antimicrobial peptide LL-37 participates in the transcriptional regulation of melanoma cells.
-
Down-regulation of 14q32-encoded miRNAs and tumor suppressor role for miR-654-3p in papillary thyroid cancer.Down-regulation of 14q32-encoded miRNAs and tumor suppressor role for miR-654-3p in papillary thyroid cancer.
-
Integration of miRNA and gene expression profiles suggest a role for miRNAs in the pathobiological processes of acute Trypanosoma cruzi infection.Integration of miRNA and gene expression profiles suggest a role for miRNAs in the pathobiological processes of acute Trypanosoma cruzi infection.
-
Integrative Biology Approaches Applied to Human DiseasesIntegrative Biology Approaches Applied to Human Diseases
-
Proteomics reveals disturbances in the immune response and energy metabolism of monocytes from patients with septic shock.Proteomics reveals disturbances in the immune response and energy metabolism of monocytes from patients with septic shock.
-
Genomics, epigenomics and pharmacogenomics of Familial Hypercholesterolemia (FHBGEP): A study protocol.Genomics, epigenomics and pharmacogenomics of Familial Hypercholesterolemia (FHBGEP): A study protocol.
-
Melatonin-Index as a biomarker for predicting the distribution of presymptomatic and asymptomatic SARS-CoV-2 carriersMelatonin-Index as a biomarker for predicting the distribution of presymptomatic and asymptomatic SARS-CoV-2 carriers
-
Profiling plasma-extracellular vesicle proteins and microRNAs in diabetes onset in middle-aged male participants in the ELSA-Brasil study.Profiling plasma-extracellular vesicle proteins and microRNAs in diabetes onset in middle-aged male participants in the ELSA-Brasil study.
-
Big Data and machine learning in cancer theranosticsBig Data and machine learning in cancer theranostics
-
Genomic positional conservation identifies topological anchor point RNAs linked to developmental loci.Genomic positional conservation identifies topological anchor point RNAs linked to developmental loci.
-
Integrative systems immunology uncovers molecular networks of the cell cycle that stratify COVID-19 severityIntegrative systems immunology uncovers molecular networks of the cell cycle that stratify COVID-19 severity