-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collections API #1606
Collections API #1606
Conversation
|
|
K: AvroRecordCodec: Boundable: JsonFormat: ClassTag, | ||
V: AvroRecordCodec: ClassTag, | ||
M: JsonFormat: GetComponent[?, Bounds[K]] | ||
](id: ID, rasterQuery: LayerQuery[K, M], numPartitions: Int, indexFilterOnly: Boolean): Seq[(K, V)] with Metadata[M] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this read should not have numPartitions
argument. All the reads should be using threads to distributed the work as much as makes sense, but by definition you only ever get one partition, the collection.
It looks like a lot of code is shared between the collection readers and rdd readers in S3, File, Hadoop, and Cassandra. HBase and Accumulo are exceptions because they have Reading a collection seems exactly like reading a partition. Maybe we can abstract all of that code into something that reads just the ranges and the splitting/filtering of those ranges can be up to the whichever reader. I'm mostly concerned because this async code is very fiddly and error-prone. I can see us having to refactor it again when we learn something new. |
} | ||
} | ||
|
||
nondeterminism.njoin(maxOpen = threads, maxQueued = threads) { range map read }.runFoldMap(identity).unsafePerformSync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readers
need to be closed, I'm unsure on what consequence, but probably nothing good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are not they closing with cache expiration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that would mess with the caching though, if you had multiple async calls n to a Collection reader. If that's the case it almost makes it seem like CollectionsReader needs to be closable, as it will invariably represent some resources.
Main use case for these is to be called from endpoints to work on "small" areas, so async read
calls are totally in scope of design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same story goes for HBase and Accumulo collection reads, they're based on some version of range/batch writers.
…by already runing jobs on a runing cluster
} | ||
|
||
object CollectionLayerReader { | ||
def njoin[K: AvroRecordCodec: Boundable, V: AvroRecordCodec]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ask for AvroRecordCodec
but we do not use them. If the readFunc
is Long => Option[Array[Byte]]
we can spin the iterator here and do the decoding. You'll need an extra refinementFilter: K => Boolean
to capture the filterIndexOnly
param.
This is the list of places where it looks like we should use
They can not be used in:
.. because one or another those backends read ranges rather than (K,V). I think that's fine, we don't need to abstract over that. |
Cannot be used in HadoopRDDReader as we use MR job |
…n/geotrellis into feature/collections-api # Conflicts: # accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloCollectionReader.scala
+1, submitted a meta-PR that tidies up the njoin signature a little. |
Feature/collections api njoin
#1605
Backends support: