Wouter BeekJavier D. FernándezRuben Verborgh
Many Data Scientists make use of Linked Open Data. However, most scientists restrict their analyses to one or two datasets (often DBpedia). One reason for this lack of variety in dataset use has been the complexity and cost of running large-scale triple stores, graph stores or property graphs. With Header Dictionary Triples (HDT) and Linked Data Fragments (LDF), the cost of Linked Data publishing has been significantly reduced. Still, Data Scientists who wish to run large-scale analyses need to query many LDF endpoints and integrate the results. Using recent innovations in data storage, compression and dissemination, we are able to compress (a large subset of) the LOD Cloud into a single file. We call this file LOD-a-lot. Because it is just one file, LOD-a-lot can be easily downloaded and shared. It can be queried locally or through an LDF endpoint. In this paper we identify several categories of use cases that previously required an expensive and complicated setup, but that can now be run over a cheap and simple LOD-a-lot file. LOD-a-lot does not expose the same functionality as a full-blown database suite, mainly offering Triple Pattern Fragments. Despite these limitations, this paper shows that there is a surprisingly wide collection of Data Science use cases that can be performed over a LOD-a-lot file. For these use cases LOD-a-lot significantly reduces the cost and complexity of doing Data Science.
Alessandro AdamouMathieu D'AquinCarlo AlloccaEnrico Motta
In virtual data integration, the data reside on their original sources without being copied and transformed on a single platform as in warehousing. Integration must be performed at query execution time and relies on transformations of the original query to many target endpoints.
Iker Esnaola-GonzalezJesús BermúdezIzaskun FernandezSantiago FernandezAitor Arnaiz
Outlier detection in the preprocessing phase of Knowledge Discovery in Databases (KDD) processes has been a widely researched topic for many years. However, identifying the potential outlier cause still remains an unsolved challenge even though it could be very helpful for determining what actions to take after detecting it. Besides, conventional outlier detection methods might still overlook outliers in certain complex contexts. In this article, Semantic Technologies are used to contribute overcoming these problems by proposing the SemOD (Semantic Outlier Detection) Framework. This framework guides the data-scientist towards the detection of certain types of outliers in WSNs (Wireless Sensor Network). Feasibility of the approach has been tested in outdoor temperature sensors and results show that the proposed approach is generic enough to apply it to different sensors, even improving the accuracy of outlier detection as well as spotting their potential cause.
Henning PetzkaClaus StadlerGeorgios KatsimprasBastian HaarmannJens Lehmann
The increasing availability of large amounts of Linked Data creates a need for software that allows for its efficient exploration. Systems enabling Faceted Browsing constitute a user-friendly solution that need to combine suitable choices for front and back end. Since a generic solution must be adjustable with respect to the data set, the underlying ontology and the knowledge graph characteristics raise several challenges and heavily influence the browsing experience. As a consequence, an understanding of these challenges becomes an important matter of study. We present a benchmark on Faceted Browsing, which allows systems to test their performance on specific choke points on the back end. Further, we address additional issues in Faceted Browsing that may be caused by problematic modelling choices within the underlying ontology.
Harsh ThakkarYashwant KeswaniMohnish DubeyJens LehmannSören Auer
Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks. Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process.
Ciro Baron NetoDimitris KontokostasGustavo PublioDiego EstevesAmit KirschenbaumSebastian Hellmann
Over the last decade, we observed a steadily increasing amount of RDF datasets made available on the web of data. The decentralized nature of the web, however, makes it hard to identify all these datasets. Even more so, when downloadable data distributions are discovered, only insufficient metadata is available to describe the datasets properly, thus posing barriers on its usefulness and reuse.
In this paper, we describe an attempt to exhaustively identify the whole linked open data cloud by harvesting metadata from multiple sources, providing insights about duplicated data and the general quality of the available metadata. This was only possible by using a probabilistic data structure called Bloom filter. Finally, we enrich existing dataset metadata with our approach and republish them through an SPARQL endpoint.
Elisa Margareth SibaraniSimon ScerriCamilo MoralesSören AuerDiego Collarana
The rapid changes on the job market and the dramatic usage of the Web have triggered the need to analyze online job adverts. This paper presents an quantitative method to infer employers skill demand using co-word analysis based on skills keyword. These keywords are extracted automatically by an Ontology-based Information Extraction (OBIE) method. An ontology called Skills and Recruitment Ontology (SARO) has been developed to represent job postings in the context of skills and competencies needed to fill a job role. During the extraction and annotation of keywords, we focus on job posting attributes and job specific skills (Tool, Product, Topic). We present our system where cross-sectional study is decoupled in two phases: (1) a customized-pipeline for extracting information whose results are a matrix of co-occurrences and correlation; and (2) content analysis to visualize the keywords' structure and network. This method reveals the technical skills in demand together with their structure for revealing significant linkages. The evaluation of OBIE method indicates the promising result of automatic keyword indexing with an overall strict F-measure at 79%. The advantage of using an ontology and reusing semantic categories enables other research groups to reproduce this method and its results.
Tirthankar DasguptaLipika DeyAbir NaskarRupsa Saha
Guangyuan Piao
User modeling for individual users on the Social Web plays an important role and is a fundamental step for personalization as well as recommendations. Recent studies have proposed different user modeling strategies considering various dimensions such as temporal dynamics and semantics of user interests.
Michael Boniface
"The “Rio+20” United Nations Conference on Sustainable Development (UNCSD) focused on the ""Green economy"" as the main concept to fight poverty and achieve a sustainable way to feed the planet. For coastal countries, this concept translates into ""Blue economy"", the sustainable exploitation of marine environments to fulfill humanity needs for resources, energy, and food.
Pages