Ontology-driven information extraction
No Thumbnail Available
Date
2017-07-20
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Information Extraction consists in obtaining structured information from unstructured
and semi-structured sources. Existing solutions use advanced methods from
the field of Natural Language Processing and Artificial Intelligence, but they usually
aim at solving sub-problems of IE, such as entity recognition, relation extraction
or co-reference resolution. However, in practice, it is often necessary to build
on the results of several tasks and arrange them in an intelligent way. Moreover,
nowadays, Information Extraction faces new challenges related to the large-scale
collections of documents in complex formats beyond plain text.
An apparent limitation of existing works is the lack of uniform representation
of the document analysis from multiple perspectives, such as semantic annotation
of text, structural analysis of the document layout and processing of the integrated
knowledge. The recent proposals of ontology-based Information Extraction do
not fully exploit the possibilities of ontologies, using them only as a reference
model for a single extraction method, such as semantic annotation, or for defining
the target schema for the extraction process.
In this thesis, we address the problem of Information Extraction from homogeneous
collections of documents i.e., sets of files that share some common properties
with respect to the content or layout. We observe that interleaving semantic
and structural analysis can benefit the results of the IE process and propose an
ontology-driven approach that integrates and extends existing solutions.
The contributions of this thesis are of theoretical and practical nature. With
respect to the first, we propose a model and a process of Semantic Information
Extraction that integrates techniques from semantic annotation of text, document
layout analysis, object-oriented modeling and rule-based reasoning. We adapt
existing solutions to enable their integration under a common ontological view
and advance the state-of-the-art in the field of semantic annotation and document
layout analysis. In particular, we propose a novel method for automatic lexicon
generation for semantic annotators, and an original approach to layout analysis,
based on common labels identification and structure recognition. We design and
implement a framework named KnowRex that realize the proposed methodology
and integrates the elaborated solutions.
Description
Dottorato di Ricerca in Matematica ed Informatica. Ciclo XXIX
Keywords
Computer science, Information extraction, Ontologie