ETL Refactor #1553

pomadchin · 2016-06-17T10:26:50Z

Move ETL to use json configuration

pomadchin · 2016-06-17T14:38:53Z

//complete json example//

datasets.json:

[  
   {  
      "name":"test",
      "ingestType":{  
         "format":"geotiff",
         "inputCredentials":"inputCredentials name",
         "output":"hadoop",
         "outputCredentials":"outputCredentials name",
         "input":"hadoop"
      },
      "path":{  
         "input":"input",
         "output":"output"
      },
      "cache":"NONE",
      "ingestOptions":{  
         "breaks":"0:ffffe5ff;0.1:f7fcb9ff;0.2:d9f0a3ff;0.3:addd8eff;0.4:78c679ff;0.5:41ab5dff;0.6:238443ff;0.7:006837ff;1:004529ff",
         "reprojectMethod":"buffered",
         "cellSize":{  
            "width":256.0,
            "height":256.0
         },
         "encoding":"geotiff",
         "tileSize":256,
         "layoutExtent":{  
            "xmin":1.0,
            "ymin":2.0,
            "xmax":3.0,
            "ymax":4.0
         },
         "resolutionThreshold":0.1,
         "pyramid":true,
         "resampleMethod":"nearest-neighbor",
         "keyIndexMethod":{  
            "type":"zorder"
         },
         "layoutScheme":"tms",
         "cellType":"int8",
         "crs":"+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs "
      }
   }
]

credentials.json:

{
  "accumulo": [{
    "name": "name",
    "zookeepers": "zookeepers",
    "instance": "instance",
    "user": "user",
    "password": "password"
  }],
  "cassandra": [{
    "name": "name",
    "allowRemoteDCsForLocalConsistencyLevel": false,
    "localDc": "datacenter1",
    "usedHostsPerRemoteDc": 0,
    "hosts": "hosts",
    "replicationStrategy": "SimpleStrategy",
    "replicationFactor": 1,
    "user": "user",
    "password": "password"
  }],
  "s3": [],
  "hadoop": []
}

lossyrob · 2016-06-17T20:10:13Z

Does this allow for additional backends to be supplied via code in client projects?

pomadchin · 2016-06-18T09:23:28Z

@lossyrob yep, https://github.com/pomadchin/geotrellis/blob/feature/etl-json/spark-etl/src/main/scala/geotrellis/spark/etl/Etl.scala#L27

…t tags to read, updated docs

pomadchin · 2016-06-20T16:31:20Z

should include #1494

RickMohr · 2016-06-22T15:01:05Z

Will there be validation for the input JSON? If I mis-spell resampleMetod by mistake it would be fantastic to get an "Unknown property - resampleMetod" error rather than a silent baffling result.

(There are JSON schema validators for most languages, listed at http://json-schema.org/implementations.html. I don't see one for Scala, but for Java https://github.com/everit-org/json-schema looks active.)

pomadchin · 2016-06-22T16:23:47Z

@RickMohr a good point, sure, more over our json validation should be improved too. thx for that notice

echeipesh · 2016-06-29T14:41:00Z

docs/spark-etl/spark-etl-run-examples.md

--layer nlcd-tms --crs EPSG:3857 --pyramid --layoutScheme {tms | floating}
+### Ingest tiles from local fs or hdfs into s3 storage command
+
+`datasets.json`:


Would it make sense to have input dataset and output config to be in separate files? This way you can have multiple sources and only one destination like --output azavea-datahub.json

@echeipesh a very good point

…/etl-json

pomadchin · 2016-07-08T12:56:06Z

@RickMohr have you got any comments? (i decided to use https://github.com/daveclayton/json-schema-validator, looks like it has larger community and supported right now)

echeipesh · 2016-07-08T16:12:11Z

spark-etl/src/main/scala/geotrellis/spark/etl/config/EtlConf.scala

+import spray.json.DefaultJsonProtocol._
+import spray.json._
+
+class EtlConf(val credentials: Credentials, val datasets: List[Input], val output: Output) {


I think its a cool idea to be able to have multiple inputs into an ETL job but EtlJob only takes one input, so it doesn't seem like having a list of them here is used?

echeipesh · 2016-07-08T16:38:54Z

Ok, my biggest comment is on handling of credentials.json. In reality it serves two related purposes, to provide credentials and to provide some back-end specific configuration, that's great. We just need to make it a lookup table that we can call out to from the input.json Currently it calls out the type of the back-end as part of the JSON structure, which is kind of strange. Ideally this is a name -> config table like;

{
  "backend-profiles": [
    {
      "name": "accumulo-gis",
      "type": "Accumulo",
      "zookeepers": "zookeepers",
      "instance": "instance",
      "user": "user",
      "password": "password"
    },
    {
      "name": "cassandra-gis",
      "type": "Cassandra"
      "allowRemoteDCsForLocalConsistencyLevel": false,
      "localDc": "datacenter1",
      "usedHostsPerRemoteDc": 0,
      "hosts": "hosts",
      "replicationStrategy": "SimpleStrategy",
      "replicationFactor": 1,
      "user": "user",
      "password": "password"
    }
  ]
}

Now related problem is that an input needs to be able to refine those configurations by adding some kind of path/table/schema selection and configure an InputPlugin that will actually fetch me some bytes. It can look like:
input.json:

   {
      "name":"nlcd-tms",
      "format":"{geotiff | temporal-geotiff}",
      "backend":{
        "type":"file"
        "profile": "datahub-local",
        "path": "file:///Data/nlcd/tiles",
      },
      "NoData": 0,
      "cache":"NONE",
   }

Here we're going to ask for everything that affects reading the tiles: what format they're in, if we should cache them, if we should force a NoData while while reading them (because nobody cares to set it correctly when writing the files) and most importantly the backend to use. The backend js object will select the credentials/config profile and add some information that identifies a dataset in there.

I think you could get a little fancy by basically merging the profile with the backend block, which would allow you to shift the configuration from the profile into the input section, but lets say that's not justified here.

Then final comment is that the output is actually what determines the index and layout used, so those sections need to be moved there:

{
  "backend": {
      "type": "S3",
      "profile": "datahub-s3"
      "prefix": "/catalog"
  },
  "reprojectMethod":"buffered",
  "pyramid":true,
  "resampleMethod":"nearest-neighbor",
  "keyIndexMethod":{
    "type":"zorder"
  },
  "layoutScheme": { // just map the fields directly to zoomed and floating layout scheme objects
     "type": "zoomed",
     "tileSize": 256,
     "crs":"EPSG:3857"
  },
}

Hopefully by dealing with backend profiles this way you can even generate a case class that you can pass to the input module like AccumuloBackend that will have all the fields set. You could parse them by partially reading the JSON object to read the "type" and then using the correct format to read it fully and shove them into a Map[String, Backend] or something like that.

I like the idea of the CustomBackend you have and that fits pretty nicely into this flow as well.

RickMohr · 2016-07-08T16:39:41Z

spark-etl/src/main/resources/credentials-schema.json

+          "instance",
+          "user",
+          "password"
+        ]


Consider adding

"additionalProperties": false

(Explained e.g. at https://spacetelescope.github.io/understanding-json-schema/reference/object.html)

@RickMohr probably description field would be good to add as well, thanks!

RickMohr · 2016-07-08T16:50:44Z

Thanks for adding validation!

lossyrob · 2016-07-29T20:39:56Z

@pomadchin can you update? I think we're ready for this to be merged, yeah?

pomadchin · 2016-07-30T16:08:20Z

@lossyrob yes, sure

pomadchin · 2016-07-30T17:01:47Z

merged master, though need to tests on any demo, how / if hbase modules work

…/etl-json

echeipesh · 2016-08-04T15:29:43Z

spark-etl/src/main/scala/geotrellis/spark/etl/config/BackendProfiles.scala

+
+sealed trait BackendProfile {
+  val name: String
+  def `type`: String


This can actually be BackendType instead of a string

echeipesh · 2016-08-04T15:49:15Z

Most of the comments in round two are around making a Backend type hierarchy to clean up all those match statements and upgrading Strings to instances of objects they name.

A really cool thing this allows is getting rid of the params all-together. Plugins could get just instances of Backend and BackendProfile and cast them to the type they must be. This would get rid of all the string parsing. The whole params: Maps[String, String] is an artifact parsing command line arguments that don't really have a deep structure that JSON does. This is a lot of code changes so probably it can be done as round 2 PR into ETL.

echeipesh · 2016-08-04T15:53:03Z

spark-etl/src/main/scala/geotrellis/spark/etl/EtlJob.scala

+
+import geotrellis.spark.etl.config._
+
+case class EtlJob(conf: EtlConf) extends Serializable {


Once backends generate their own params, what is the purpose of this class? We could just pass EtlConf at that point, but does that lose meaning?

pomadchin · 2016-08-11T14:06:08Z

includes rather important hbase deps fix ("org.codehaus.jackson" % "jackson-core-asl" % "1.8.3" intransitive())

pomadchin · 2016-08-12T16:17:47Z

without last dirty fix:

Exception in thread "main" java.lang.ClassCastException: geotrellis.spark.io.index.ZCurveKeyIndexMethod$ cannot be cast to geotrellis.spark.io.index.KeyIndexMethod
    at geotrellis.spark.etl.config.Output.getKeyIndexMethod(Output.scala:45)
    at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.writer(SpatialAccumuloOutput.scala:13)
    at geotrellis.spark.etl.OutputPlugin$class.apply(OutputPlugin.scala:19)
    at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.apply(SpatialAccumuloOutput.scala:11)
    at geotrellis.spark.etl.Etl.savePyramid$1(Etl.scala:178)
    at geotrellis.spark.etl.Etl.save(Etl.scala:192)
    at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:47)
    at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:39)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at geotrellis.spark.etl.Etl$.ingest(Etl.scala:39)
    at geotrellis.chatta.ChattaIngest$delayedInit$body.apply(ChattaIngest.scala:15)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at geotrellis.chatta.ChattaIngest$.main(ChattaIngest.scala:12)
    at geotrellis.chatta.ChattaIngest.main(ChattaIngest.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

echeipesh · 2016-08-19T12:26:22Z

💯 Thank you for pushing this for so long, looks really good.

pomadchin added 3 commits June 17, 2016 13:23

json etl refactor

22db557

deprecate scallop; code refactors

9bcfc05

tests cfg codereduce

5a273b1

keyIndexMethod refactor; json formats additions; usage example in tests

73d45e5

pomadchin added 2 commits June 20, 2016 14:38

remove keyIndexMethod argument, added Pyramid options, temporal inges…

9f0a1fa

…t tags to read, updated docs

after tests issues

3348b8c

pomadchin changed the title ~~[WIP] ETL Refactor~~ ETL Refactor Jun 20, 2016

pomadchin changed the title ~~ETL Refactor~~ [WIP] ETL Refactor Jun 22, 2016

pomadchin mentioned this pull request Jun 28, 2016

ETL Refactor geotrellis/geotrellis-landsat-emr-demo#5

Merged

7 tasks

etl refactor; +file backend support

db3c38b

echeipesh reviewed Jun 29, 2016
View reviewed changes

pomadchin added 4 commits July 7, 2016 15:32

Merge branch 'master' of github.com:pomadchin/geotrellis into feature…

1c5edd3

…/etl-json

+output to a separate file

faa5006

improve examples

7726c5b

Merge branch 'master' of github.com:pomadchin/geotrellis into feature…

594b258

…/etl-json

+add validation schema

b4adc3d

pomadchin changed the title ~~[WIP] ETL Refactor~~ ETL Refactor Jul 8, 2016

echeipesh reviewed Jul 8, 2016
View reviewed changes

RickMohr reviewed Jul 8, 2016
View reviewed changes

make EtlJob serializable

6376fab

Merge branch 'master' of github.com:pomadchin/geotrellis into feature…

738fdf4

…/etl-json

echeipesh reviewed Aug 4, 2016
View reviewed changes

pomadchin mentioned this pull request Aug 10, 2016

All backends support with etl refactor geotrellis/geotrellis-chatta-demo#23

Merged

pomadchin added 2 commits August 11, 2016 15:03

Eugene's comments

31837be

add hbase dep; +fix readme

828aa10

pomadchin added 3 commits August 12, 2016 14:24

prase profiles during backend parsing

6fa590d

remove props map

9584c81

remove Map[String, String] params

abdab6f

pomadchin added 2 commits August 12, 2016 19:20

fix weird cast error

10d1b14

fix hbase profile parsing

21c3adb

echeipesh merged commit 2077839 into locationtech:master Aug 19, 2016

lossyrob added this to the 1.0 milestone Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL Refactor #1553

ETL Refactor #1553

pomadchin commented Jun 17, 2016 •

edited

Loading

pomadchin commented Jun 17, 2016 •

edited

Loading

lossyrob commented Jun 17, 2016

pomadchin commented Jun 18, 2016

pomadchin commented Jun 20, 2016

RickMohr commented Jun 22, 2016 •

edited

Loading

pomadchin commented Jun 22, 2016

echeipesh Jun 29, 2016

pomadchin Jun 29, 2016

pomadchin commented Jul 8, 2016

echeipesh Jul 8, 2016

echeipesh commented Jul 8, 2016

RickMohr Jul 8, 2016

pomadchin Jul 8, 2016

RickMohr commented Jul 8, 2016

lossyrob commented Jul 29, 2016

pomadchin commented Jul 30, 2016

pomadchin commented Jul 30, 2016

echeipesh Aug 4, 2016

echeipesh commented Aug 4, 2016

echeipesh Aug 4, 2016

pomadchin commented Aug 11, 2016

pomadchin commented Aug 12, 2016 •

edited

Loading

echeipesh commented Aug 19, 2016


		import geotrellis.spark.etl.config._

		case class EtlJob(conf: EtlConf) extends Serializable {

ETL Refactor #1553

ETL Refactor #1553

Conversation

pomadchin commented Jun 17, 2016 • edited Loading

pomadchin commented Jun 17, 2016 • edited Loading

lossyrob commented Jun 17, 2016

pomadchin commented Jun 18, 2016

pomadchin commented Jun 20, 2016

RickMohr commented Jun 22, 2016 • edited Loading

pomadchin commented Jun 22, 2016

echeipesh Jun 29, 2016

Choose a reason for hiding this comment

pomadchin Jun 29, 2016

Choose a reason for hiding this comment

pomadchin commented Jul 8, 2016

echeipesh Jul 8, 2016

Choose a reason for hiding this comment

echeipesh commented Jul 8, 2016

RickMohr Jul 8, 2016

Choose a reason for hiding this comment

pomadchin Jul 8, 2016

Choose a reason for hiding this comment

RickMohr commented Jul 8, 2016

lossyrob commented Jul 29, 2016

pomadchin commented Jul 30, 2016

pomadchin commented Jul 30, 2016

echeipesh Aug 4, 2016

Choose a reason for hiding this comment

echeipesh commented Aug 4, 2016

echeipesh Aug 4, 2016

Choose a reason for hiding this comment

pomadchin commented Aug 11, 2016

pomadchin commented Aug 12, 2016 • edited Loading

echeipesh commented Aug 19, 2016

pomadchin commented Jun 17, 2016 •

edited

Loading

pomadchin commented Jun 17, 2016 •

edited

Loading

RickMohr commented Jun 22, 2016 •

edited

Loading

pomadchin commented Aug 12, 2016 •

edited

Loading