Por favor, use este identificador para citar o enlazar este ítem: https://hdl.handle.net/10495/37581
Título : Automatic information extraction in business document
Autor : Moreno Acevedo, Santiago Andres
metadata.dc.contributor.advisor: Orozco Arroyave, Juan Rafael
Vasquez Correa, Juan Camilo
metadata.dc.subject.*: Procesamiento de Lenguaje Natural
Natural Language Processing
Inteligencia artificial
Machine learning
Aprendizaje Profundo
Deep Learning
Minería de datos
Data mining
Artifical Intelligence
Business Intelligence
Information Extraction
Fecha de publicación : 2023
Resumen : ABSTRACT : Information Extraction (IE) is a topic of Natural Language Processing that has gained interest in the research community for its applications in real-world areas, such as law environments where the analysis of documents is very important. So far, IE has been extensively studied in general contexts with ideal data with many samples per class. However, real-world contexts do not have either large amounts of data or balance among classes. Therefore, it is necessary to develop models that can handle real-world data problems. This master's thesis aims to investigate techniques and methods for handling limited and unbalanced data in Natural Language Processing (NLP) contexts. The goal is to implement these techniques and methods into a software tool that can automatically extract information from documents. With this aim, two NLP approaches were studied: Named Entity Recognition (NER) and Relation Classification (RC). Different methods were analyzed, including both architectural and data-related approaches. To address the class imbalance, several loss functions were explored to create a model that prioritizes samples that are hard to classify. Additionally, data augmentation strategies were employed to face the limited data problem. A methodology for NER was developed, integrating data augmentation strategies and the focal loss function into a benchmark model. For RC, we identified a state-of-the art architecture that uses the focal loss function and performs well with limited data. The outcomes for NER and RC were satisfactory at the end of the work. Finally, both the NER methodology and the RC architecture wwew integrated into a software tool that enables automatic NER and RC tasks for any given document. This work is the first stage in creating an automatic document analysis tool.
Aparece en las colecciones: Maestrías de la Facultad de Ingeniería

Ficheros en este ítem:
Fichero Descripción Tamaño Formato  
MorenoSantiago_2023_InformationExtractionDeepLearningNaturalLanguageProcessing.pdfTesis de maestría2.39 MBAdobe PDFVisualizar/Abrir


Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons