Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom CollectionReader? #20

Open
carschno opened this issue Jul 27, 2016 · 3 comments
Open

Use custom CollectionReader? #20

carschno opened this issue Jul 27, 2016 · 3 comments

Comments

@carschno
Copy link
Member

According to the documentation, DKPro BigData can process 'any file we have a UIMa collection reader for'.
However, the method DkproHadoopDriver.run(String[] args) strictly requires the specification of input and output paths. I have a use case in which the reader should read from a MySQL database (using the JdbcReader), hence not reading from the file system, with the following class.

public class CountFreqsHadoop extends DkproHadoopDriver {

@Override
public CollectionReader buildCollectionReader() throws ResourceInitializationException
{
    return createReader(JdbcReader.class,
            JdbcReader.PARAM_DATABASE, Util.DB_NAME,
            JdbcReader.PARAM_USER, Util.DB_USER,
            JdbcReader.PARAM_PASSWORD, Util.DB_PASS,
            JdbcReader.PARAM_QUERY, Util.QUERY,
            JdbcReader.PARAM_CONNECTION, Util.CONNECTION,
            JdbcReader.PARAM_CONNECTION_PARAMS, Util.CONNECTION_PARAMS,
            JdbcReader.PARAM_DRIVER, Util.DRIVER,
            JdbcReader.PARAM_LANGUAGE, "de");
}

When I submit the job to the Hadoop (or Yarn) cluster with any parameters, the process is aborted with the following message:

$ ~/hadoop-2.7.1/bin/hadoop jar zeit-1.9.0-SNAPSHOT.jar CountFreqsHadoop 
Usage: CountFreqsHadoop [hadoop-params] input output [job-params]

When I add fake parameters (test), an InvalidInputException is thrown:

$ ~/hadoop-2.7.1/bin/hadoop jar zeit-1.9.0-SNAPSHOT.jar CountFreqsHadoop test test
[...]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/XXX/test
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:328)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)

[...]

Should this behaviour be fixed or is it currently just not possible to use an input reader like this?

@reckart
Copy link
Member

reckart commented Jul 28, 2016

Hm, I would say in principle using a non-filesystem-based reader should work. If the implementation currently expects a filesystem-based reader, it may be possible to change that.

Did you try customizing the DkproHadoopDriver such that you could e.g. specify "-none-" as input and/or output and that the driver handles that by simply not setting the job input/output?

@carschno
Copy link
Member Author

I did not try that yet. I've had some doubts about the role of the CASWritableSequenceFileWriter which is initialized with CASWritableSequenceFileWriter.PARAM_PATH, inputPath.toString() ("The folder to write the generated XMI files to."). If the input path is set to -none-, PARAM_PATH might default to a random temporary dir in the user's home, I guess.

@reckart
Copy link
Member

reckart commented Jul 28, 2016

I was rather thinking about adding a condition like

if (!"-none-".equals(args[0])) {
        Path inputPath;
        if (args[0].contains(",")) {
            String[] inputPaths = args[0].split(",");
            inputPath = new Path(inputPaths[0]);
            for (String path : inputPaths) {
                FileInputFormat.addInputPath(job, new Path(path));
            }
        }
        else {
            inputPath = new Path(args[0]); // input
            FileInputFormat.setInputPaths(this.job, inputPath);
        }
}

Hm... it seems rather odd to me that the CASWritableSequenceFileWriter should try writing to inputPath. Maybe changing that to write to a (temporary) HDFS working directory wouldn't hurt anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants