Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10495/12525
Título : Efficient Storage of Genomic Sequences in High Performance Computing Systems
Autor : Guerra Soler, Aníbal José
metadata.dc.contributor.advisor: Isaza Ramírez, Sebastián
Aedo Cobo, José Edinson
metadata.dc.subject.*: Genomic sequences
Parallel computing
Performance evaluation
Reads alignment
Reads Compression
Referential compression
SIMD programming
Fecha de publicación : 2019
Citación : Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.
Resumen : ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction.
Aparece en las colecciones: Doctorados en Ingeniería

Ficheros en este ítem:
Fichero Descripción Tamaño Formato  
GuerraSolerAnibal_2019_EfficientStorageGenomic.pdfTesis de Doctorado6.88 MBAdobe PDFVisualizar/Abrir

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons


Gestión de T.I. /Sistema de Bibliotecas / Universidad de Antioquia / Cl. 67 Nº 53 - 108 - Bloque 8 Conmutador: 219 51 51- 219 51 40 bibliotecadigital@udea.edu.co Medellín - Colombia