Dataset Dependencies

Datasets in maven can be published either as conventional jar archives or as specially typed artifacts such as csv.bz2 or ttl.gz.

If a dataset is packaged as a conventional jar file, then one can also use a conventional dependency declaration in order to place the contained file(s) on the classpath.

Otherwise, the maven-dependency-plugin:copy goal can be used to place a set of typed artifact into their right place. Typically, one wants to place datasets from typed artifacts in the same location as if they had been placed under /src/main/resources. This is accomplished by configuring maven-dependency-plugin to copy dependencies to the output directory ${project.build.outputDirectory}. Setting stripVersion=true produces a file whose name is independent from the dependency version and thus makes it easy to reference it from the source code. A complete example is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<project
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"
  xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.github.myaccount</groupId>
  <artifactId>ml-project</artifactId>
  <version>1.0.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-dependency-plugin</artifactId>
        <version>3.1.1</version>
        <executions>
          <execution>
            <phase>install</phase>
            <goals>
              <goal>copy</goal>
            </goals>
            <!-- For a reference of all configuration options for the 'copy' goal 
              refer to: https://maven.apache.org/plugins/maven-dependency-plugin/copy-mojo.html -->
            <configuration>
              <!--
              Setting 'stripVersion=true' results in the file
                target/classes/resilience.bz2 
              whereas 'stripVersion=false' results in the file
                target/classes/resilience-2022-05-02.1-SNAPSHOT.bz2
              The former file name is easier to reference from code
              -->
              <stripVersion>true</stripVersion>
              <artifactItems>
                <artifactItem>
                  <groupId>org.example.ml.models</groupId>
                  <artifactId>resilience</artifactId>
                  <version>2022-05-02.1-SNAPSHOT</version>
                  <type>bz2</type>
                  <!-- The setting of the output directory resolves to 'target/classes' 
                    which is the same place where files under src/main/resources go -->
                  <outputDirectory>${project.build.outputDirectory}</outputDirectory>
                </artifactItem>
              </artifactItems>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

</project>