Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Update code to use Dataflow's new support for custom sources #61

Open
deflaux opened this issue May 4, 2015 · 1 comment
Open

Update code to use Dataflow's new support for custom sources #61

deflaux opened this issue May 4, 2015 · 1 comment

Comments

@deflaux
Copy link
Contributor

deflaux commented May 4, 2015

We manually create data shards right now via --references (or --allReferences) and --basesPerShard from ShardOptions.

Updating to custom sources will allow the data shards to be not only createed but also re-sharded dynamically.

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/Source

https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/Source.java

@pgrosu
Copy link

pgrosu commented May 4, 2015

That would be awesome! Yeah, I mentioned something along these lines a while ago here, and very excited to see it:

googlegenomics/spark-examples#49 (comment)

Maybe the size of the requested region can be processed by a function to dynamically return and define the --basesPerShard.

Thanks and very excited to see the results!
~p

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants