Better HDFS Support #1556

echeipesh · 2016-06-21T17:53:16Z

This PR improves HDFS layer writing and random value reading support.

HadoopValuesReader now opens up MapFile.Readers for each map file in the layer. These readers cache the index of available keys so they are able to provide quick lookups. This replaces and improves previous method of using FileInputFormat to query for a single record.

HadoopRDDWriter has multiple improvements:

Accomplishes it's task with a single shuffle step
- This is possible because instead of estimating the blocks required by counting we simply roll over to a new file when the record being written is about to surpass the block boundary.
- Instead of using a groupBy we sort our records on shuffle which uses IO index to partition the records
Writes happen through mapping the partition iterator, this is possible because the partition comes pre-sorted and greatly reduces memory pressure as records can be garbage collected after they are written.

Current testing shows:

HDFS ingest speeds exceed those seen in Accumulo
tile fetching latency is visually acceptable
that jobs using the GroupedCumulativeIterator approach are able to complete in settings where .groupBy/sort/write jobs are killed for memory violations.

Future improvements that on the mind:

Save bloom filter index for layer MapFiles to reduce memory requirements for HadoopValueReader and to produce quicker lookup in most cases.
In HadoopValueReader when fetching a record, cache all the tiles that share that index before filtering down to a single tile. These records are very likely to be asked for next and should be stored in LRU cache.
Devise a method to compare KeyIndex instances such that if the RDD to be saved is already partitioned by compatible index (where all key boundaries are shared) to avoid the extra shuffle.

…it is useful

…DD(K,V)

…ly, reducing memory pressure

pomadchin · 2016-06-24T17:34:10Z

spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala

@@ -43,7 +43,7 @@ object HadoopRDDWriter extends LazyLogging {
          MapFile.Writer.keyClass(classOf[LongWritable]),
          MapFile.Writer.valueClass(classOf[BytesWritable]),
          MapFile.Writer.compression(SequenceFile.CompressionType.NONE))
-      writer.setIndexInterval(1)
+      writer.setIndexInterval(32)


how it would effect query time? (just curious)

Layer query time would not be effected at all by this. When we're reading off ranges we're already seeking through the file, so not having as many points would have minimal impact if any. This is going to have more of an impact on random access through value reader. Spinning up a cluster to figure that part out. For reference the default interval value is 128

…ding it

…s on attributeStore type

…ks in cache lookups

lossyrob · 2016-07-05T16:21:13Z

spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala

+  /**
+    * When record being written would exceed the block size of the current MapFile
+    * opens a new file to continue writing. This allows to split partition into block-sized
+    * chunks without foreknowledge of how bit it is.


bit => big

lossyrob · 2016-07-05T16:34:50Z

+1 after comments addressed

echeipesh added 11 commits June 21, 2016 13:50

removed used config

58c0628

Implement HadoopValueReaders using cached MapFile.Reader

028ed95

move index to breaks function from accumulo to spark project because …

0e3367e

…it is useful

add IndexPartitioner that uses key index and breaks to partition an R…

2f3d124

…DD(K,V)

Add grouping iterator, allows to process an RDD partition incremental…

c804b72

…ly, reducing memory pressure

optimize hadoop RDD writer

efb71de

fix: ShuffledRDD does not share an interface between 1.5 and 1.6

5768229

fix: doc error

2433470

fix: avoid creating empty map files for empty partitions

969257d

fix: closing null

1128a7a

Adjust HDFS query logic to allow for sparse map file indexes

3eb5fd3

pomadchin reviewed Jun 24, 2016
View reviewed changes

echeipesh added 9 commits June 25, 2016 11:24

add LRU cache for MapFileReaders to HadoopValueReader

324373e

save first index in the map file name

1d4f870

hadoop value reader file cache shared among all layers

e13fd50

added evicted hook to cache so resources can be closed

f642813

save first index in the map file name as an optimization to avoid rea…

59a776e

…ding it

expose indexInterval as a param

d1142e2

Remove sparkContext dependency from HadoopValueReader, also relax req…

ac7e228

…s on attributeStore type

remove dead code

f98cfd9

use java concurrent collections to implement LRUCache, avoids deadloc…

e2ff618

…ks in cache lookups

echeipesh changed the title ~~[WIP] Better HDFS Support~~ Better HDFS Support Jul 5, 2016

lossyrob reviewed Jul 5, 2016
View reviewed changes

review changes

db45024

echeipesh merged commit 41ba42d into locationtech:master Jul 5, 2016

echeipesh mentioned this pull request Jul 11, 2016

HDFS Incremental Layer Updater #1583

Closed

lossyrob added this to the 1.0 milestone Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better HDFS Support #1556

Better HDFS Support #1556

echeipesh commented Jun 21, 2016 •

edited

Loading

pomadchin Jun 24, 2016

echeipesh Jun 24, 2016

lossyrob Jul 5, 2016

lossyrob commented Jul 5, 2016

Better HDFS Support #1556

Better HDFS Support #1556

Conversation

echeipesh commented Jun 21, 2016 • edited Loading

pomadchin Jun 24, 2016

Choose a reason for hiding this comment

echeipesh Jun 24, 2016

Choose a reason for hiding this comment

lossyrob Jul 5, 2016

Choose a reason for hiding this comment

lossyrob commented Jul 5, 2016

echeipesh commented Jun 21, 2016 •

edited

Loading