About the project

Project Name: PROV4ITDATA

Provenance-aware Querying and Generation for Interoperable and Transparent Data Transfer

Team: Ben De Meester, Pieter Heyvaert, Ruben Verborgh

Data portability enables users to control their data on the Web. Initiatives such as the Data Transfer Project take a step in this direction, providing an open-source, service-to-service data portability platform allowing individuals to move their data whenever they want.

However, such efforts — being hard-coded — miss transparency and interoperability. On the one hand, when data is transferred between services, there is no trust in whether this transfer was of high quality. Assessment requires reviewing the source code, as data provenance is unavailable. On the other hand, complying to this hard-coded platform requires development effort to create custom data models. Existing interoperability standards are neglected.

In this project, we propose an improved solution that is fully transparent and has fine-grained configuration to improve interoperability with other data models. To achieve this, we will exploit and advance the existing open-source tools RML.io and Comunica and show its extensibility by directly applying it to the Solid ecosystem. RML.io is a declarative and generic toolset to transparently generate Linked Data from (semi-)structured heterogeneous data with automatic data provenance and fine-grained configuration to extract and transform (parts of) the source data. Using Comunica, we can query multiple intermediate datasets to transfer to a new service, generating a provenance trail of where the resulting data came from.

We improve current data portability approaches by combining RML.io and Comunica for a fully transparent transfer process, where assessing the provenance trail can add trust. The transfer of personal data can be assessed before the data is accessed, and legal audits can be performed automatically based on the structured and semantically sound provenance trail. To show its applicability and extensibility for decentralized environments, we will connect it to the Solid ecosystem, giving users full control over their data.


Within the first phase of the DAPSI program, we were able to bootstrap our PROV4ITData project. Not only on a technical level by improving the technology stack and creating a demonstrator at https://prov4itdata.ilabt.imec.be, but also more importantly design a business model and reflecting on product-market fit.


During the second phase of the DAPSI program, we streamlined and feature-completed the PROV4ITDaTa demonstrator. We completed a first set of supported web services and finalized the architecture to easily extend the configurations to other web services.

We promoted our platform not only via academic publications, but also by participating in multiple industry-oriented events (Knowledge Graph Conference and NLNet), next to the public DAPSI events.

Being part of the DAPSI program helped us discover new leads and get a bigger connection between Data Portability communities and companies across Europe. We continue having fruitful discussions with multiple data portability stakeholders to this day. Additionally, we started the Comunica Association as a dedicated income stream for our open-source projects to enable sustainable development and maintenance in the future.



Overall, we really value our participation in the first DAPSI phase. The concurrent race with the other contestants provided a constant feedback loop and allowed us to more consciously create a product from state-of-the-art technology. DAPSI allowed us to zoom out and take the bigger picture into account, whilst also providing the means to technically advance our product.

Ben De Meester


    Ben De Meester

    Postdoctoral Researcher at imec, researching high-quality Linked Data generation from heterogeneous data sources, and one of the leads of the https://rml.io/ toolchain.

      Pieter Heyvaert

      development lead and developer advocate at imec, with expertise in high-quality Linked Data generation from heterogeneous data sources being one of the leads of the https://rml.io toolchain, and Linked Data querying and publishing with Comunica and Solid.

        Ruben Verborgh

        Professor of Decentralized Web technology at imec, and a research affiliate at the Decentralized Information Group of CSAIL at MIT, acting as a technology advocate for Inrupt and the Solid Ecosystem, and aiming to build a more intelligent generation of clients for a decentralized Web at the intersection of Linked Data and hypermedia-driven Web APIs.


        IMEC is an R&D hub for nano- and digital technologies. IMEC believes in the combination of extremely talented people and a world-class infrastructure to enable a prosperous and sustainable future for all.