Command: integrate

rpt integrate is the command for mixed RDF data and SPARQL statement processing. The name stems from the various SPARQL extensions that make it possible to reference and process non-RDF data inside a SPARQL statements.

Basic Usage

Example 1: Simple Processing

rpt integrate file1.ttl 'INSERT DATA { eg:s eg:p eg:o }' spo.rq

The command above does the following:

  • It loads file1.ttl (into the default graph)
  • It runs the given SPARQL update statement which adds a triple. For convenience, RPT includes a static copy of prefixes from prefix.cc. The prefix eg is defined as http://www.example.org/.
  • It executes the query in the “file” spo.rq which is CONSTRUCT WHERE { ?s ?p ?o } and prints out the result. To be precise spo.rq is provided file in the JAR bundle (a class path resource). RPT ships with several predefined queries for common use cases.

Notes

  • If you want RPT to print out the result of a query then you need to provide a query! If you omit spo.rq in the example above, rpt will only run the loading and the update.
  • As alternatives for spo.rq, you can use gspo.rq to print out all quads and spogspo.rq to print out the union of triples and quads.
  • The file extension .rq stands for RDF query. Likewise .ru stands for RDF update.

Example 2: Starting a server

rpt integrate --server

This command starts a SPARQL server, by default on port 8642. Use e.g. --port 8003 to pick a different one. You can mix this with the arguments from the first example.

Example 3: Loading RDF into Named Graphs

The option graph= (no whitespace before the =) sets the graph for all subsequent triple-based RDF files. To use the default graph again, use graph= followed by a whitespace or simply graph.

rpt integrate graph=urn:foo file1.nt file2.ttl 'graph=http://www.example.org/' file3.nt.bz2 graph file4.ttl.gz

For quad-based data, the SPARQL MOVE statement can be used to post-process the data after loading.

rpt integrate data.trig 'MOVE <urn:foo> TO <urn:bar>'.

Example 4: Using a different RDF Database Engine

RPT can run the RDF loading and SPARQL query execution on different (embedded) engines.

rpt integrate --db-engine tdb2 --db-loc --db-keep mystores/mydata file.ttl spo.rq

rpt integrate -e tdb2 --loc --db-keep mystores/mydata file.ttl spo.rq

By default, rpt integrate uses the in-memory engine. The --engine (short -e) option allows choosing a different RDF engine. For engines that require a file or a database folder, the location can be uniformly specified with --db-loc (short --loc). By default, RPT will by default delete data it created itself but it will never delete existing data. The flag --db-keep instructs RPT to keep database it created after termination.

Example 5: SPARQL Proxy

You can quickly launch a SPARQL proxy with the combination of -e and --server:

rpt integrate -e remote --loc https://dbpedia.org/sparql --server

The proxy gives you a Yasgui frontend and the Linked Data Viewer.

Endpoints protected by basic authentication can be proxied by supplying the credentials with the URL:

rpt integrate -e remote --loc https://USER:PWD@dbpedia.org/sparql --server

Note, that this is unsafe and should be avoided in production, but it can be useful during development.

Example 6: Indexing Spatial Data

Data that follows the GeoSPARQL standard can be indexed by providing the --geoindex option. The index will be automatically updated on the first query or update request that needs to access the data after any data modification. Statements that intrinsically do not rely on the spatial index, namely LOAD, INSERT DATA and DELETE DATA mark the spatial index as potentially dirty but do not trigger immediate index recreation.

rpt integrate --server --geoindex spatial-data.ttl

Embedded SPARQL Engines

Embedded SPARQL engines are built into RPT and thus readily available. The following engines are currently available:

EngineDescription
memThe default in-memory engine based on Apache Jena. Data is discarded once the RPT process terminates.
tdb2Apache Jena's TDB2 persisent engine. Use --loc to specfify the database folder.
binsearchBinary search engine that operates directly on sorted N-Triples files. Use --loc to specify the file path or HTTP(s) URL to the N-Triples file. For URLs, HTTP range requests must be supported!
remoteA pseudo engine that forwards all processing to the SPARQL endpoint whole URL is specified in --loc.

(ARQ) Engine Configuration

The engines mem, tdb2 and binsearch build an Jena’s query engine ARQ and thus respect its configuration.

rpt integrate --set 'arq:queryTimeout=60000' myQuery.rq


Table of contents