Scalable Semantic Version Control for Linked Data Management
Linked Data is the Semantic Web’s established standard for publishing and interlinking data. When authors collaborate on a data set, distribution and tracking of changes are crucial aspects. Several approaches for version control were presented in this area each focusing on different aspects and bounded to different limitations. In this project we work on a solution for semantic version control for dynamic Linked Data based on a delta-based storage strategy and on-demand reconstruction of historic versions. The approach is intended for large data sets and should support targeted and cross-version queries.
Linked Data provides essential mechanisms to efficiently interlink and integrate data using RDF as base model. RDF stores information as directed graph. Edges are defined by triples consisting of subject, predicate and object and nodes are defined implicitly through the edges and are referenced by URIs. Edges can be grouped to named graphs to facilitate the administration or to store additional information by assigning a context, which transforms triples to quads. SPARQL 1.1 Query Language can be used as query language for Linked Data using pattern matching, filtering, aggregation and even distributed query execution to query several data sources at once. SPARQL 1.1 Update can be used to manipulate the triples inside a triple store.
A missing feature not covered by the Linked Data standard so far is version control. Especially when several authors are involved (which is obviously the case for the data amounts addressed by Linked Data) tracking and distribution of changes and rolling back to previous revisions are crucial aspects for any kind of data management. Recent research projects created several approaches for version control of Linked Data with focus on different aspects. They cover versioning of data of OWL ontologies, lightweight RDFS ontologies and Linked Data, support different workflows and enable knowledge workers to run different query types on versioned data. Some are limited in scalability regarding the number of triples, the number of versions or because of space efficiency. Some solutions hide the version information from the Linked Data layer preventing the access of version information by SPARQL queries.
Since Linked Data is designed to handle and publish large data sets we focus on scalable semantic version control. This comes with new limitations, especially the variety of query types that can be handled. Because of the amounts of data we want to handle, we use a delta-based strategy. For querying data, triples of historic versions that are used for query evaluation are reconstructed on-demand. This comes at the cost, that global queries (which require the whole dataset for a specific version to be constructed) can hardly be supported using a pure version-based storage. The construction of versions occurs very frequently, thus the version construction performance is the critical factor.
Since the latest versions are likely to be used more often within queries, some of these versions could be materialised. This hybrid storage could increase efficiency since for newer versions only small parts of the history has to be evaluated backwards instead of walking through the history and looking for the first or last change made to a specific triple.
The goal of this research project is to propose an approach for scalable semantic version control for dynamic Linked Data based on partial on-demand reconstruction of historic versions for queries and update mechanisms, that store only the deltas for new versions. We focus on query optimization for targeted queries on random versions stored in a delta-based, distributed storage. The version control information itself should be accessible by SPARQL queries to support cross-version queries.
Methodology and Progress
We analyze existing approaches and the performance bottlenecks for large data sets. The bottlenecks are discussed and approaches for solutions are created. We use the open source triple store BlazeGraph and add a prototypical implementation of our approach to the query engine. The performance of the implemented approach will be evaluated with several data sets and cluster configurations on the Amazon Elastic Compute Cloud. The Amazon Elastic Compute Cloud is used since we can easily evaluate our approach with many different hardware configurations and many different sizes of the triple store cluster.
First results  were already presented on the 2nd Workshop on Linked Data Quality co-located with 12th Extended Semantic Web Conference (ESWC 2015).
Problems and Solution Ideas
We handle versioning on the level of complete graphs and we model the delta-sets as Linked Data. Each version of a graph is referenced by a commit and, except the first one, each commit references (ref) a previous commit (prev). Since several commits can reference the same previous commit it is possible to work on branches in parallel. Merging two branches is done by creating a commit that is referencing two previous commits. All edges are defined by triples and as these triples belong to the same graph, we can store them as quads and use the context information as identifier for the triple. The commits either add or remove triples stored as edges that reference these triples having a predicate of type add or delete. Merge commits store only those triple modifications that are ambiguous and would lead to conflicts. We store branches and tags as edges that link a branch URI to their current commit. The branches and tags are referenced by an edge connecting them to a graph. The following figure shows an example instance of this model.
Update queries have to be modified since only triples are added to the triple store that describe the delta to the previous version. If we add a triple we have to verify if it already exists at the current version, if we delete a triple we have to verify that this triple exists. Otherwise we would unnecessary triples to the history. If we use a hybrid storage strategy and ae going back in history to check whether a triple exists in a specific version we would get incorrect results. The queries have to be rewritten and to be executed efficiently to be scalable.
For Evaluation we use multiple implementations using two kinds of delta-based storages and a hybrid based storage.
We use multiple test datasets with different numbers of triples, commits, branches and changes per triple. The amount of queries that can run in parallel are also evaluated.
As metrics for the evaluation we use the average response time for queries and the number of triples that are stored additionally in the triple store.
As Benchmarks we use solutions for version control that are semantic or scalable and a triple store without versioning support.
- Hauptmann, C., Brocco, M., Wörndl, W.: Scalable Semantic Version Control for Linked Data Management. In: Rula, A., Zaveri, A., Knuth, M., Kontokostas, D. (eds.) Proceedings of the 2nd Workshop on Linked Data Quality co-located with 12th Extended Semantic Web Conference (ESWC 2015). CEUR Workshop Proceedings, Portorož, Slovenia (2015)