Research & Innovation

Roman ProkofyevDjellel DifallahMichael LuggenPhilippe Cudre-Mauroux

Webpages are an abundant source of textual information with manually annotated entity links, and are often used as a source of training data for a wide variety of machine learning NLP tasks. However, manual annotations such as those found on Wikipedia are sparse, noisy, and biased towards popular entities. Existing entity linking systems deal with those issues by relying on simple statistics extracted from the data. While such statistics can effectively deal with noisy annotations, they introduce bias towards head entities and are ineffective for long tail (e.g., unpopular) entities. In this work, we first analyze statistical properties linked to manual annotations by studying a large annotated corpus composed of all English Wikipedia webpages, in addition to all pages from the CommonCrawl containing English Wikipedia annotations. We then propose and evaluate a series of entity linking approaches, with the explicit goal of creating highly-accurate (precision > 95\%) and broad annotated corpuses for machine learning tasks. Our results show that our best approach achieves maximal-precision at usable recall levels, and outperforms both state-of-the-art entity-linking systems and human annotators.

Kris McGlinnChristophe DebruyneLorraine McNerneyDeclan O’Sullivan
Isaiah Onando Mulang'Kuldeep SinghFabrizio Orlandi

Research has seen considerable achievements concerning the translation of natural language patterns into formal queries for Question Answering based on Knowledge Graphs (KG). The main challenge exists on how to identify which property within a Knowledge Graph matches the predicate found in a Natural Language (NL) relation. Current approaches for formal query generation attempt to resolve this problem mainly by first retrieving the named entity from the KG together with a list of its predicates, then filtering out one from all the predicates of the entity. We attempt an approach to directly match an NL predicate to KG properties that can be employed within QA pipelines. In this paper, we specify a systematic approach as well as providing a tool that can be employed to solve this task. Our approach models KB relations with their underlying parts of speech, we then enhance this with extra attributes obtained from Wordnet and Dependency parsing characteristics. From a question, we model a similar representation of query relations. We then define distance measurements between the query relation and the properties representations from the KG to identify which property is referred to by the relation within the query. We report substantive recall values and considerable accuracy from our evaluation.

Najmeh Mousavi NejadSimon ScerriSören Auer

With the omnipresent availability and use of cloud services, software tools, Web portals or services, legal contracts in the form of license agreements or terms and conditions regulating their use are of paramount importance. Often the textual documents describing these regulations comprise many pages and can not be reasonably assumed to be read and understood by humans. In this work, we describe a method for extracting and clustering relevant parts of such documents, including permissions, obligations, and prohibitions. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. An evaluation shows that it can significantly improve human comprehension and that improved feature-based clustering has a potential to further reduce the time required for EULA digestion. Our implementation is available as a Web service, which can directly be used to process and prepare legal usage contracts.

Sebastian BaderJan Oevermann

Service technicians in the domain of industrial maintenance require extensive technical knowledge and experience to complete their tasks. Some of the needed knowledge is made available as document-based technical manuals or service reports from previous deployments. Unfortunately, due to the great amount of data, service technicians spend a considerable amount of working time searching for the correct information. Another challenge is posed by the fact that valuable insights from operation reports are not yet considered due to insufficient textual quality and content-wise ambiguity. In this work we propose a framework to annotate and integrate these heterogeneous data sources to make them available as information units with Linked Data technologies. We use machine learning to modularize and classify information from technical manuals together with ontology-based autocompletion to enrich service reports with clearly defined concepts. By combining these two approaches we can provide an unified and structured interface for both manual and automated querying. We verify our approach by measuring precision and recall of information for typical retrieval tasks for service technicians, and show that our framework can provide substantial improvements for service and maintenance processes.

Jan Voskuil

Linked Data and the Semantic Web have generated interest in the Netherlands from the very beginning. Sporting several renowned research centers and some widely published early application projects, the Netherlands is home to Platform Linked Data Nederland, a grass-roots movement promoting Linked Data technologies which functions as a marketplace for exchanging ideas and experiences.

Georgios SantipantakisGeorge VourosChristos DoulkeridisAkrivi VlachouGennady AndrienkoNatalia AndrienkoJose Manuel CorderoMiguel Garcia Martinez

Motivated by real-life emerging needs in critical domains, this paper proposes a coherent and generic ontology for the representation of semantic trajectories, in association with related events and contextual information. The main contribution of the proposed ontology is the representation of semantic trajectories at varying, interlinked levels of spatio-temporal analysis. The paper presents the ontology in detail, also in connection to other well-known ontologies, and demonstrates how exploiting data at varying levels of granularity supports data transformations that can support visual analytics tasks in the air-traffic management domain.

Owen Sacco

Creating video games that are market competent costs in time, effort and resources which often cannot be afforded by small-medium enterprises, especially by independent game development studios. As most of the tasks involved in developing games are labour and creativity intensive, our vision is to reduce software development effort and enhance design creativity by automatically generating novel and semantically-enriched content for games from Web sources. In particular, this paper presents a vocabulary that defines detailed properties used for describing video game characters information extracted from sources such as fansites to create game character models. These character models could then be reused or merged to create new unconventional game characters.

Alex OliemanKaspar BeelenJaap KampsMilan van Lange

We investigate the Digital Humanities use case, where scholars spend a considerable amount of time selecting relevant source texts. We developed WideNet; a semantically-enhanced search tool which leverages the strengths of (imperfect) EL without getting in the way of its expert users. We evaluate this tool in two historical case-studies aiming to collect a set of references to historical periods in parliamentary debates from the last two decades; the first targeted the Dutch Golden Age, and the second World War II.
The case-studies conclude with a critical reflection on the utility of WideNet for this kind of research, after which we outline how such a real-world application can help to improve EL technology in general.

Pages

Subscribe to RSS - Research & Innovation