-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL Refactor #1553
ETL Refactor #1553
Conversation
//complete json example//
[
{
"name":"test",
"ingestType":{
"format":"geotiff",
"inputCredentials":"inputCredentials name",
"output":"hadoop",
"outputCredentials":"outputCredentials name",
"input":"hadoop"
},
"path":{
"input":"input",
"output":"output"
},
"cache":"NONE",
"ingestOptions":{
"breaks":"0:ffffe5ff;0.1:f7fcb9ff;0.2:d9f0a3ff;0.3:addd8eff;0.4:78c679ff;0.5:41ab5dff;0.6:238443ff;0.7:006837ff;1:004529ff",
"reprojectMethod":"buffered",
"cellSize":{
"width":256.0,
"height":256.0
},
"encoding":"geotiff",
"tileSize":256,
"layoutExtent":{
"xmin":1.0,
"ymin":2.0,
"xmax":3.0,
"ymax":4.0
},
"resolutionThreshold":0.1,
"pyramid":true,
"resampleMethod":"nearest-neighbor",
"keyIndexMethod":{
"type":"zorder"
},
"layoutScheme":"tms",
"cellType":"int8",
"crs":"+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs "
}
}
]
{
"accumulo": [{
"name": "name",
"zookeepers": "zookeepers",
"instance": "instance",
"user": "user",
"password": "password"
}],
"cassandra": [{
"name": "name",
"allowRemoteDCsForLocalConsistencyLevel": false,
"localDc": "datacenter1",
"usedHostsPerRemoteDc": 0,
"hosts": "hosts",
"replicationStrategy": "SimpleStrategy",
"replicationFactor": 1,
"user": "user",
"password": "password"
}],
"s3": [],
"hadoop": []
} |
Does this allow for additional backends to be supplied via code in client projects? |
…t tags to read, updated docs
should include #1494 |
Will there be validation for the input JSON? If I mis-spell (There are JSON schema validators for most languages, listed at http://json-schema.org/implementations.html. I don't see one for Scala, but for Java https://github.com/everit-org/json-schema looks active.) |
@RickMohr a good point, sure, more over our json validation should be improved too. thx for that notice |
--layer nlcd-tms --crs EPSG:3857 --pyramid --layoutScheme {tms | floating} | ||
### Ingest tiles from local fs or hdfs into s3 storage command | ||
|
||
`datasets.json`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have input dataset and output config to be in separate files? This way you can have multiple sources and only one destination like --output azavea-datahub.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echeipesh a very good point
@RickMohr have you got any comments? (i decided to use https://github.com/daveclayton/json-schema-validator, looks like it has larger community and supported right now) |
import spray.json.DefaultJsonProtocol._ | ||
import spray.json._ | ||
|
||
class EtlConf(val credentials: Credentials, val datasets: List[Input], val output: Output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its a cool idea to be able to have multiple inputs into an ETL job but EtlJob
only takes one input, so it doesn't seem like having a list of them here is used?
Ok, my biggest comment is on handling of {
"backend-profiles": [
{
"name": "accumulo-gis",
"type": "Accumulo",
"zookeepers": "zookeepers",
"instance": "instance",
"user": "user",
"password": "password"
},
{
"name": "cassandra-gis",
"type": "Cassandra"
"allowRemoteDCsForLocalConsistencyLevel": false,
"localDc": "datacenter1",
"usedHostsPerRemoteDc": 0,
"hosts": "hosts",
"replicationStrategy": "SimpleStrategy",
"replicationFactor": 1,
"user": "user",
"password": "password"
}
]
} Now related problem is that an input needs to be able to refine those configurations by adding some kind of path/table/schema selection and configure an InputPlugin that will actually fetch me some bytes. It can look like: {
"name":"nlcd-tms",
"format":"{geotiff | temporal-geotiff}",
"backend":{
"type":"file"
"profile": "datahub-local",
"path": "file:///Data/nlcd/tiles",
},
"NoData": 0,
"cache":"NONE",
} Here we're going to ask for everything that affects reading the tiles: what format they're in, if we should cache them, if we should force a NoData while while reading them (because nobody cares to set it correctly when writing the files) and most importantly the I think you could get a little fancy by basically merging the profile with the Then final comment is that the output is actually what determines the index and layout used, so those sections need to be moved there: {
"backend": {
"type": "S3",
"profile": "datahub-s3"
"prefix": "/catalog"
},
"reprojectMethod":"buffered",
"pyramid":true,
"resampleMethod":"nearest-neighbor",
"keyIndexMethod":{
"type":"zorder"
},
"layoutScheme": { // just map the fields directly to zoomed and floating layout scheme objects
"type": "zoomed",
"tileSize": 256,
"crs":"EPSG:3857"
},
} Hopefully by dealing with backend profiles this way you can even generate a case class that you can pass to the input module like I like the idea of the CustomBackend you have and that fits pretty nicely into this flow as well. |
"instance", | ||
"user", | ||
"password" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding
"additionalProperties": false
(Explained e.g. at https://spacetelescope.github.io/understanding-json-schema/reference/object.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RickMohr probably description field would be good to add as well, thanks!
Thanks for adding validation! |
@pomadchin can you update? I think we're ready for this to be merged, yeah? |
@lossyrob yes, sure |
merged master, though need to tests on any demo, how / if hbase modules work |
|
||
sealed trait BackendProfile { | ||
val name: String | ||
def `type`: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can actually be BackendType
instead of a string
Most of the comments in round two are around making a A really cool thing this allows is getting rid of the |
|
||
import geotrellis.spark.etl.config._ | ||
|
||
case class EtlJob(conf: EtlConf) extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once backends generate their own params, what is the purpose of this class? We could just pass EtlConf at that point, but does that lose meaning?
includes rather important hbase deps fix ( |
without last dirty fix: Exception in thread "main" java.lang.ClassCastException: geotrellis.spark.io.index.ZCurveKeyIndexMethod$ cannot be cast to geotrellis.spark.io.index.KeyIndexMethod
at geotrellis.spark.etl.config.Output.getKeyIndexMethod(Output.scala:45)
at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.writer(SpatialAccumuloOutput.scala:13)
at geotrellis.spark.etl.OutputPlugin$class.apply(OutputPlugin.scala:19)
at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.apply(SpatialAccumuloOutput.scala:11)
at geotrellis.spark.etl.Etl.savePyramid$1(Etl.scala:178)
at geotrellis.spark.etl.Etl.save(Etl.scala:192)
at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:47)
at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:39)
at scala.collection.immutable.List.foreach(List.scala:318)
at geotrellis.spark.etl.Etl$.ingest(Etl.scala:39)
at geotrellis.chatta.ChattaIngest$delayedInit$body.apply(ChattaIngest.scala:15)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at geotrellis.chatta.ChattaIngest$.main(ChattaIngest.scala:12)
at geotrellis.chatta.ChattaIngest.main(ChattaIngest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) |
💯 Thank you for pushing this for so long, looks really good. |
Move ETL to use json configuration