-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve avro support #691
Improve avro support #691
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #691 +/- ##
==========================================
- Coverage 71.34% 71.09% -0.26%
==========================================
Files 41 44 +3
Lines 1752 1816 +64
Branches 246 291 +45
==========================================
+ Hits 1250 1291 +41
- Misses 502 525 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
4955b36
to
2cd2a64
Compare
Delta("repeated_record.nested_repeated_field", Option(jl(10, 20, 30)), None, UnknownDelta), | ||
Delta("repeated_record.string_field", Option("b"), None, UnknownDelta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result does not make sense to me. As we are keying by field, we should get the same output as map comparison.
I propose an output that puts the key in the field path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Can you explain/comment on the source of this change so it is more apparent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm roughly OK with this change but I'd want to verify my understanding of how this output looks like with multiple nestings. I think if we are putting it on the field path we probably want to also recursively pass the field keys which I don't think is happening here https://github.com/spotify/ratatool/pull/691/files#diff-c02df8e7364f3ad968c28ac24f66eb4de3e9f15ef79dd7f05f5c05b1ecb98225R112
Though perhaps I'm misreading something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is altered wen presenting the diff result here: https://github.com/spotify/ratatool/pull/691/files#diff-c02df8e7364f3ad968c28ac24f66eb4de3e9f15ef79dd7f05f5c05b1ecb98225R115
|
||
import org.scalacheck.util.Buildable | ||
|
||
private[scalacheck] object HashMapBuildable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to port that upstream in typelevel/scalacheck#1023
Hi! I'm taking a look! Thanks for the contribution! |
2cd2a64
to
0cb8f51
Compare
Updated to latest scio. Code now also works for avro 1.8.2 |
ratatool-scalacheck/src/main/scala-2.12/com/spotify/ratatool/scalacheck/HashMapBuilder.scala
Outdated
Show resolved
Hide resolved
ratatool-scalacheck/src/main/scala-2.13/com/spotify/ratatool/scalacheck/HashMapBuilder.scala
Outdated
Show resolved
Hide resolved
0cb8f51
to
d0cb104
Compare
ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I'd like to get a couple things clarified before approving. For some of the changes here, it's a bit hard to understand whether these could break any user code or not, without having full context on the recent avro/coder updates. Clarifying that to the extent possible would be ideal 👍
On another topic, did you use a different formatter version than what we currently have in prod? If possible, I'd like to reduce formatting changes to a minimum to keep better track of this, since it is a somewhat big PR.
new SchemaValidatorBuilder().canReadStrategy | ||
.validateLatest() | ||
.validate(y.getSchema, List(x.getSchema).asJava) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were there any breaking changes on this implementation, or is there any other reason why we're moving away from avro schema validation? I'm not very familiar with how strict is the avro definition of "compatible schemas", but at first glance it seems like we're loosing some flexibility and/or some level of detail with the new validations. This is not my area of expertise, though, so your recommendations are more than welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of diff, avro schemas must be strictly equal. This is checked line 62 in the new version.
Compatible schema are used on read time to adapt stored data to the desired read schema. Once in memory, we should not compare data constructed with different schema, even if compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on what you mean? I'm a bit confused reading this thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avro record with different models must not be compared.
schema compatibility is relevant when reading, making sure the writerSchema
and the readerSchema
are compatible. Once read, the records strictly folow the readerSchema
where field index matters. Strict schema equality must be ensured before comparing content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, trying to confirm my understanding here since I'm still a bit confused.
Dataset A is updated to add new nullable field x
and becomes Dataset A'.
We go to diff these two datasets.
Are you saying that this field comparison will end up in one of the above null cases prior to the schema comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go one step further and say it should necessarily be the A' schema, even if it's nullable/has default, and cases where a field is missing should fail. Semantically, it's still a difference between the two datasets. IIRC this is the current functionality. It's unclear to me if/where this behaviour is retained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BigDiffy API of this lib is not file aware. It only works in terms of in-memory records and can't make any assumption on writer schema.
It is up to the users creating the SCollection
to read using the correct schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for the diffAvro API, the reader schema used is the one from the generated class.
It is totally possible that underlying files are using a different schema, ratatool-diffy
will miss those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BigDiffy API of this lib is not file aware.
cc @RustedBones, AFAIK, BigDiffy is file-aware when run through the CLI, which invokes BigDiffy#run:
ratatool/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala
Lines 751 to 760 in 51ec798
val schema = new AvroSampler(rhs, conf = Some(sc.options)) | |
.sample(1, head = true) | |
.head | |
.getSchema | |
implicit val grCoder: Coder[GenericRecord] = avroGenericRecordCoder(schema) | |
val diffy = new AvroDiffy[GenericRecord](ignore, unordered, unorderedKeys) | |
val lhsSCollection = sc.avroFile(lhs, schema) | |
val rhsSCollection = sc.avroFile(rhs, schema) | |
BigDiffy | |
.diff[GenericRecord](lhsSCollection, rhsSCollection, diffy, avroKeyFn(keys), ignoreNan) |
but even then, it looks like it selects the schema associated with the RHS and uses that for both resulting SCollections. So maybe we could add schema validation there (ensure that RHS schema is equal to, or a superset of, the LHS schema)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an extra check that will prefer backward compatible reader schema.
When schemas are different, but both forward and backward compatible, will print a warning
ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala
Outdated
Show resolved
Hide resolved
case (null, null) => Seq.empty | ||
case (_, null) => Seq(Delta("", Some(x), None, UnknownDelta)) | ||
case (null, _) => Seq(Delta("", None, Some(y), UnknownDelta)) | ||
case _ if x.getSchema != y.getSchema => Seq(Delta("", Some(x), Some(y), UnknownDelta)) | ||
case _ => diff(x, y, x.getSchema, "") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this refactor! 👍
val a = x.asInstanceOf[IndexedRecord] | ||
val b = y.asInstanceOf[IndexedRecord] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! Seems more resource-efficient. Is there any case in which this cast could not work, though, so to add in some catch statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if schema type is a record, IndexedRecord
is the least powerful abstraction we need to check equality. We were previously using GenericRecord
that extends IndexedRecord
, but equality can be done on field order.
import java.nio.ByteBuffer | ||
import scala.jdk.CollectionConverters._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run sbt scalaFmt
? I'm surprised there's so many formatting changes. Are you running a different formatter version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is prob my intelliJ organize import. I don't think we have a scalafmt rule for import order.
Delta("repeated_record.nested_repeated_field", Option(jl(10, 20, 30)), None, UnknownDelta), | ||
Delta("repeated_record.string_field", Option("b"), None, UnknownDelta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Can you explain/comment on the source of this change so it is more apparent?
) | ||
} | ||
|
||
it should "support schema evolution if ignored" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify how the rest of the tests are covering this test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not covered. records with different schemas are not equal, as explained above.
ratatool-scalacheck/src/test/scala/com/spotify/ratatool/scalacheck/AvroGeneratorTest.scala
Show resolved
Hide resolved
private def getRawType(schema: Schema): Schema = { | ||
schema.getType match { | ||
private def numericValue(value: AnyRef): Double = value match { | ||
case i: java.lang.Integer => i.toDouble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why .toDouble
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because numericDelta only supports double
Remove the need of data serialization to generate avro specific record. Enable logical-type conversion as described in the avro specification.
dfcc526
to
1dd677b
Compare
5b49e78
to
6bbf8da
Compare
logger.warn("Avro schemas are compatible, but not equal. Using schema from {}", rhs) | ||
} | ||
rhsSchema | ||
case (COMPATIBLE, INCOMPATIBLE) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in underlying functionality, IMO We should also warn in these cases rather than proceed transparently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I still think we enforce RHS backwards compatibility unless otherwise shown to be necessary, but if we are changing functionality/flexibility then we need to do so in a way that is transparent to users.
I'll leave the actual decision here down to current members of the owning team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted to previous behavior using the SchemaValidatorBuilder
that thows a SchemaValidationException
with detailed error in case of incompatibility.
8e66c03
to
dcd413d
Compare
Compile with avro 1.8 and run test with latest avro (avro 1.8 generates broken code)