Automatic information extraction in business document

Moreno Acevedo, Santiago Andres

Por favor, use este identificador para citar o enlazar este ítem: https://hdl.handle.net/10495/37581

Título :	Automatic information extraction in business document
Autor :	Moreno Acevedo, Santiago Andres
metadata.dc.contributor.advisor:	Orozco Arroyave, Juan Rafael Vasquez Correa, Juan Camilo
metadata.dc.subject.*:	Procesamiento de Lenguaje Natural Natural Language Processing Inteligencia artificial Machine learning Aprendizaje Profundo Deep Learning Minería de datos Data mining Artifical Intelligence Business Intelligence Information Extraction
Fecha de publicación :	2023
Resumen :	ABSTRACT : Information Extraction (IE) is a topic of Natural Language Processing that has gained interest in the research community for its applications in real-world areas, such as law environments where the analysis of documents is very important. So far, IE has been extensively studied in general contexts with ideal data with many samples per class. However, real-world contexts do not have either large amounts of data or balance among classes. Therefore, it is necessary to develop models that can handle real-world data problems. This master's thesis aims to investigate techniques and methods for handling limited and unbalanced data in Natural Language Processing (NLP) contexts. The goal is to implement these techniques and methods into a software tool that can automatically extract information from documents. With this aim, two NLP approaches were studied: Named Entity Recognition (NER) and Relation Classification (RC). Different methods were analyzed, including both architectural and data-related approaches. To address the class imbalance, several loss functions were explored to create a model that prioritizes samples that are hard to classify. Additionally, data augmentation strategies were employed to face the limited data problem. A methodology for NER was developed, integrating data augmentation strategies and the focal loss function into a benchmark model. For RC, we identified a state-of-the art architecture that uses the focal loss function and performs well with limited data. The outcomes for NER and RC were satisfactory at the end of the work. Finally, both the NER methodology and the RC architecture wwew integrated into a software tool that enables automatic NER and RC tasks for any given document. This work is the first stage in creating an automatic document analysis tool.
Aparece en las colecciones:	Maestrías de la Facultad de Ingeniería

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
MorenoSantiago_2023_InformationExtractionDeepLearningNaturalLanguageProcessing.pdf	Tesis de maestría	2.39 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons