-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better HDFS Support #1556
Better HDFS Support #1556
Conversation
…ly, reducing memory pressure
@@ -43,7 +43,7 @@ object HadoopRDDWriter extends LazyLogging { | |||
MapFile.Writer.keyClass(classOf[LongWritable]), | |||
MapFile.Writer.valueClass(classOf[BytesWritable]), | |||
MapFile.Writer.compression(SequenceFile.CompressionType.NONE)) | |||
writer.setIndexInterval(1) | |||
writer.setIndexInterval(32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how it would effect query time? (just curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layer query time would not be effected at all by this. When we're reading off ranges we're already seeking through the file, so not having as many points would have minimal impact if any. This is going to have more of an impact on random access through value reader. Spinning up a cluster to figure that part out. For reference the default interval value is 128
…s on attributeStore type
…ks in cache lookups
/** | ||
* When record being written would exceed the block size of the current MapFile | ||
* opens a new file to continue writing. This allows to split partition into block-sized | ||
* chunks without foreknowledge of how bit it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit
=> big
+1 after comments addressed |
This PR improves HDFS layer writing and random value reading support.
HadoopValuesReader
now opens upMapFile.Readers
for each map file in the layer. These readers cache the index of available keys so they are able to provide quick lookups. This replaces and improves previous method of usingFileInputFormat
to query for a single record.HadoopRDDWriter
has multiple improvements:Current testing shows:
GroupedCumulativeIterator
approach are able to complete in settings where .groupBy/sort/write jobs are killed for memory violations.Future improvements that on the mind:
HadoopValueReader
and to produce quicker lookup in most cases.HadoopValueReader
when fetching a record, cache all the tiles that share that index before filtering down to a single tile. These records are very likely to be asked for next and should be stored in LRU cache.