Use custom CollectionReader? #20

carschno · 2016-07-27T12:38:33Z

According to the documentation, DKPro BigData can process 'any file we have a UIMa collection reader for'.
However, the method DkproHadoopDriver.run(String[] args) strictly requires the specification of input and output paths. I have a use case in which the reader should read from a MySQL database (using the JdbcReader), hence not reading from the file system, with the following class.

public class CountFreqsHadoop extends DkproHadoopDriver {

@Override
public CollectionReader buildCollectionReader() throws ResourceInitializationException
{
    return createReader(JdbcReader.class,
            JdbcReader.PARAM_DATABASE, Util.DB_NAME,
            JdbcReader.PARAM_USER, Util.DB_USER,
            JdbcReader.PARAM_PASSWORD, Util.DB_PASS,
            JdbcReader.PARAM_QUERY, Util.QUERY,
            JdbcReader.PARAM_CONNECTION, Util.CONNECTION,
            JdbcReader.PARAM_CONNECTION_PARAMS, Util.CONNECTION_PARAMS,
            JdbcReader.PARAM_DRIVER, Util.DRIVER,
            JdbcReader.PARAM_LANGUAGE, "de");
}

When I submit the job to the Hadoop (or Yarn) cluster with any parameters, the process is aborted with the following message:

$ ~/hadoop-2.7.1/bin/hadoop jar zeit-1.9.0-SNAPSHOT.jar CountFreqsHadoop 
Usage: CountFreqsHadoop [hadoop-params] input output [job-params]

When I add fake parameters (test), an InvalidInputException is thrown:

$ ~/hadoop-2.7.1/bin/hadoop jar zeit-1.9.0-SNAPSHOT.jar CountFreqsHadoop test test
[...]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/XXX/test
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:328)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)

[...]

Should this behaviour be fixed or is it currently just not possible to use an input reader like this?

The text was updated successfully, but these errors were encountered:

reckart · 2016-07-28T13:27:42Z

Hm, I would say in principle using a non-filesystem-based reader should work. If the implementation currently expects a filesystem-based reader, it may be possible to change that.

Did you try customizing the DkproHadoopDriver such that you could e.g. specify "-none-" as input and/or output and that the driver handles that by simply not setting the job input/output?

carschno · 2016-07-28T13:36:39Z

I did not try that yet. I've had some doubts about the role of the CASWritableSequenceFileWriter which is initialized with CASWritableSequenceFileWriter.PARAM_PATH, inputPath.toString() ("The folder to write the generated XMI files to."). If the input path is set to -none-, PARAM_PATH might default to a random temporary dir in the user's home, I guess.

reckart · 2016-07-28T13:40:59Z

I was rather thinking about adding a condition like

if (!"-none-".equals(args[0])) {
        Path inputPath;
        if (args[0].contains(",")) {
            String[] inputPaths = args[0].split(",");
            inputPath = new Path(inputPaths[0]);
            for (String path : inputPaths) {
                FileInputFormat.addInputPath(job, new Path(path));
            }
        }
        else {
            inputPath = new Path(args[0]); // input
            FileInputFormat.setInputPaths(this.job, inputPath);
        }
}

Hm... it seems rather odd to me that the CASWritableSequenceFileWriter should try writing to inputPath. Maybe changing that to write to a (temporary) HDFS working directory wouldn't hurt anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use custom CollectionReader? #20

Use custom CollectionReader? #20

carschno commented Jul 27, 2016

reckart commented Jul 28, 2016

carschno commented Jul 28, 2016

reckart commented Jul 28, 2016

Use custom CollectionReader? #20

Use custom CollectionReader? #20

Comments

carschno commented Jul 27, 2016

reckart commented Jul 28, 2016

carschno commented Jul 28, 2016

reckart commented Jul 28, 2016