Efficient Storage of Genomic Sequences in High Performance Computing Systems

Guerra Soler, Aníbal José

Por favor, use este identificador para citar o enlazar este ítem: https://hdl.handle.net/10495/12525

Título :	Efficient Storage of Genomic Sequences in High Performance Computing Systems
Autor :	Guerra Soler, Aníbal José
metadata.dc.contributor.advisor:	Isaza Ramírez, Sebastián Aedo Cobo, José Edinson
metadata.dc.subject.*:	Performance - evaluation Genomic sequences Parallel computing Reads alignment Reads compression Referential compression SIMD programming http://id.loc.gov/authorities/subjects/sh2010105499
Fecha de publicación :	2019
Citación :	Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.
Resumen :	ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction.
Aparece en las colecciones:	Doctorados de la Facultad de Ingeniería

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
GuerraSolerAnibal_2019_EfficientStorageGenomic.pdf	Tesis doctoral	6.88 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons