Extension to the databus-maven-plugin,
used to download and derive new datasets already released on the databus.
mvn databus-derive:clone
downloads all data into the build directory target/databus/derive/downloads
, mvn clean
deletes the downloaded data again.
Later the plugin will feature better data housekeeping option, such as persisting some downloads in ${user.home}/.m2/databus
and caching often used conversions.
working beta
- we are using it to derive the monthly DBpedia releases from the MARVIN pre-releases (parse and remove syntax errors)
- Code overlaps with the Databus Client. Merging and refactoring is required.
Current caveats:
- putting a version that does not exist into
<version>
will result in ajava.util.NoSuchElementException: QueryIterPlainWrapper
- using is much less flexible than just giving the SPARQL query
- can only parse RDF-NTriples as
.bz2
other data is simply copied - Per default parsing data and cloning (e.g. create new versions for the databus-maven-plugin) are deactivated.
You can turn them on with
-DskipParsing=false -DskipCloning=false
, but beware the plugin needs three times the space of the data (easier to debug).
Add the snapshot plugin repository, not required if you already use the databus-maven-plugin
<pluginRepositories>
<pluginRepository>
<id>archiva.internal</id>
<name>Internal Release Repository</name>
<url>http://databus.dbpedia.org:8081/repository/internal</url>
</pluginRepository>
<pluginRepository>
<id>archiva.snapshots</id>
<name>Internal Snapshot Repository</name>
<url>http://databus.dbpedia.org:8081/repository/snapshots</url>
<snapshots>
<updatePolicy>always</updatePolicy>
</snapshots>
</pluginRepository>
</pluginRepositories>
Add the following plugin entry to the BaseBuild(<build>)
element in your maven pom.
<build>
...
<plugins>
<plugin>
<groupId>org.dbpedia.databus</groupId>
<artifactId>databus-derive-maven-plugin</artifactId>
<version>1.0-SNAPSHOT</version>
<executions>
<execution>
<goals>
<goal>clone</goal>
</goals>
</execution>
</executions>
<configuration>
<versions>
<version>https://databus.dbpedia.org/dbpedia/enrichment/mappingbased-literals/2019.03.01</version>
</versions>
<skipParsing>true</skipParsing>
<skipCloning>true</skipCloning>
</configuration>
</plugin>
</plugins>
</build>
Note that you can also use a variable for configuration. This configuration downloads the whole DBpedia/mappings group.
<properties>
<!-- used for derive plugin, can be anything really -->
<databus.deriveversion>2019.08.01</databus.deriveversion>
</properties>
<configuration>
<versions>
<version>https://databus.dbpedia.org/marvin/mappings/geo-coordinates-mappingbased/${databus.deriveversion}</version>
<version>https://databus.dbpedia.org/marvin/mappings/instance-types/${databus.deriveversion}</version>
<version>https://databus.dbpedia.org/marvin/mappings/mappingbased-literals/${databus.deriveversion}</version>
<version>https://databus.dbpedia.org/marvin/mappings/mappingbased-objects/${databus.deriveversion}</version>
<version>https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/${databus.deriveversion}</version>
<version>https://databus.dbpedia.org/marvin/mappings/specific-mappingbased-properties/${databus.marvinversion}</version>
</versions>
<skipParsing>true</skipParsing>
<skipCloning>true</skipCloning>
</configuration>
Maven manual install
git clone https://github.com/dbpedia/databus-derive.git
cd databus-derive
mvn clean install
Execute the plugin.
# normal
export MAVEN_OPTS="-Xmx24000m"
# for 64Cores ~ 64GB to parse 5GB bz2
export MAVEN_OPTS="-Xmx64000m"
mvn databus-derive:clone
It is also possible to just create cleaned triples and pareslogs.
mvn scala:run -Dlauncher=flatRdfParser -DaddArgs="$flat-rdf-in|-o|$flat-rdf-out|-r|$report-out"
or
./flatRdfParser <flat-rdf-in> <flat-rdf-out> <report-out>
for help add --help
git clone https://github.com/dbpedia/databus-derive.git
cd databus-derive/ && mvn clean install
wget http://dbpedia-mappings.tib.eu/release/mappings/mappingbased-literals/2019.10.01/mappingbased-literals_lang\=de.ttl.bz2
./flatRdfParser mappingbased-literals_lang=de.ttl.bz2 --discard-warnings > mappingbased-literals_lang=de_cleaned.ttl