Improve avro support #691

RustedBones · 2023-11-28T11:08:38Z

Do not rely on beam serialization to generate avro records
Add proper support of logical-types according to avro spec
Refactor AvroBigDiffy to leverage type information

~~Compile with avro 1.8 and run test with latest avro (avro 1.8 generates broken code)~~

codecov · 2023-11-28T11:09:18Z

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (51ec798) 71.34% compared to head (dcd413d) 71.09%.

Files	Patch %	Lines
...om/spotify/ratatool/scalacheck/AvroGenerator.scala	80.23%	17 Missing ⚠️
...n/scala/com/spotify/ratatool/diffy/AvroDiffy.scala	89.02%	9 Missing ⚠️
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala	50.00%	7 Missing ⚠️
...m/spotify/ratatool/scalacheck/HashMapBuilder.scala	75.00%	1 Missing ⚠️
...m/spotify/ratatool/scalacheck/HashMapBuilder.scala	75.00%	1 Missing ⚠️
...spotify/ratatool/scalacheck/HashMapBuildable.scala	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #691      +/-   ##
==========================================
- Coverage   71.34%   71.09%   -0.26%     
==========================================
  Files          41       44       +3     
  Lines        1752     1816      +64     
  Branches      246      291      +45     
==========================================
+ Hits         1250     1291      +41     
- Misses        502      525      +23

Flag	Coverage Δ
ratatoolCli	`2.92% <0.00%> (-0.07%)`	⬇️
ratatoolCommon	`0.00% <ø> (ø)`
ratatoolDiffy	`32.86% <80.31%> (+1.23%)`	⬆️
ratatoolExamples	`17.40% <50.50%> (+1.39%)`	⬆️
ratatoolSampling	`62.36% <79.38%> (+<0.01%)`	⬆️
ratatoolScalacheck	`78.14% <77.31%> (-3.34%)`	⬇️
ratatoolShapeless	`4.20% <0.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RustedBones · 2023-11-28T11:49:00Z

ratatool-diffy/src/test/scala/com/spotify/ratatool/diffy/AvroDiffyTest.scala

-        Delta("repeated_record.nested_repeated_field", Option(jl(10, 20, 30)), None, UnknownDelta),
-        Delta("repeated_record.string_field", Option("b"), None, UnknownDelta)


This result does not make sense to me. As we are keying by field, we should get the same output as map comparison.
I propose an output that puts the key in the field path.

I see your point. Can you explain/comment on the source of this change so it is more apparent?

I'm roughly OK with this change but I'd want to verify my understanding of how this output looks like with multiple nestings. I think if we are putting it on the field path we probably want to also recursively pass the field keys which I don't think is happening here https://github.com/spotify/ratatool/pull/691/files#diff-c02df8e7364f3ad968c28ac24f66eb4de3e9f15ef79dd7f05f5c05b1ecb98225R112

Though perhaps I'm misreading something here.

The key is altered wen presenting the diff result here: https://github.com/spotify/ratatool/pull/691/files#diff-c02df8e7364f3ad968c28ac24f66eb4de3e9f15ef79dd7f05f5c05b1ecb98225R115

RustedBones · 2023-11-28T15:07:07Z

ratatool-scalacheck/src/main/scala/com/spotify/ratatool/scalacheck/HashMapBuildable.scala

+
+import org.scalacheck.util.Buildable
+
+private[scalacheck] object HashMapBuildable {


Trying to port that upstream in typelevel/scalacheck#1023

benkonz · 2023-12-01T16:35:30Z

Hi! I'm taking a look! Thanks for the contribution!

project/plugins.sbt

RustedBones · 2024-01-26T13:39:10Z

Updated to latest scio. Code now also works for avro 1.8.2

ratatool-scalacheck/src/main/scala-2.12/com/spotify/ratatool/scalacheck/HashMapBuilder.scala

ratatool-scalacheck/src/main/scala-2.13/com/spotify/ratatool/scalacheck/HashMapBuilder.scala

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

monzalo14

Thanks for the contribution! I'd like to get a couple things clarified before approving. For some of the changes here, it's a bit hard to understand whether these could break any user code or not, without having full context on the recent avro/coder updates. Clarifying that to the extent possible would be ideal 👍

On another topic, did you use a different formatter version than what we currently have in prod? If possible, I'd like to reduce formatting changes to a minimum to keep better track of this, since it is a somewhat big PR.

monzalo14 · 2024-02-05T22:06:49Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

-    new SchemaValidatorBuilder().canReadStrategy
-      .validateLatest()
-      .validate(y.getSchema, List(x.getSchema).asJava)


Were there any breaking changes on this implementation, or is there any other reason why we're moving away from avro schema validation? I'm not very familiar with how strict is the avro definition of "compatible schemas", but at first glance it seems like we're loosing some flexibility and/or some level of detail with the new validations. This is not my area of expertise, though, so your recommendations are more than welcome!

In terms of diff, avro schemas must be strictly equal. This is checked line 62 in the new version.

Compatible schema are used on read time to adapt stored data to the desired read schema. Once in memory, we should not compare data constructed with different schema, even if compatible.

Can you expand on what you mean? I'm a bit confused reading this thread

avro record with different models must not be compared.

schema compatibility is relevant when reading, making sure the writerSchema and the readerSchema are compatible. Once read, the records strictly folow the readerSchema where field index matters. Strict schema equality must be ensured before comparing content.

Ok, trying to confirm my understanding here since I'm still a bit confused.
Dataset A is updated to add new nullable field x and becomes Dataset A'.
We go to diff these two datasets.
Are you saying that this field comparison will end up in one of the above null cases prior to the schema comparison?

I would go one step further and say it should necessarily be the A' schema, even if it's nullable/has default, and cases where a field is missing should fail. Semantically, it's still a difference between the two datasets. IIRC this is the current functionality. It's unclear to me if/where this behaviour is retained

The BigDiffy API of this lib is not file aware. It only works in terms of in-memory records and can't make any assumption on writer schema.

It is up to the users creating the SCollection to read using the correct schema

Even for the diffAvro API, the reader schema used is the one from the generated class.

It is totally possible that underlying files are using a different schema, ratatool-diffy will miss those.

The BigDiffy API of this lib is not file aware.

cc @RustedBones, AFAIK, BigDiffy is file-aware when run through the CLI, which invokes BigDiffy#run:

ratatool/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala

Lines 751 to 760 in 51ec798

val schema = new AvroSampler(rhs, conf = Some(sc.options))

.sample(1, head = true)

.head

.getSchema

implicit val grCoder: Coder[GenericRecord] = avroGenericRecordCoder(schema)

val diffy = new AvroDiffy[GenericRecord](ignore, unordered, unorderedKeys)

val lhsSCollection = sc.avroFile(lhs, schema)

val rhsSCollection = sc.avroFile(rhs, schema)

BigDiffy

.diff[GenericRecord](lhsSCollection, rhsSCollection, diffy, avroKeyFn(keys), ignoreNan)

but even then, it looks like it selects the schema associated with the RHS and uses that for both resulting SCollections. So maybe we could add schema validation there (ensure that RHS schema is equal to, or a superset of, the LHS schema)?

Added an extra check that will prefer backward compatible reader schema.
When schemas are different, but both forward and backward compatible, will print a warning

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

monzalo14 · 2024-02-05T22:28:24Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+    case (null, null)                    => Seq.empty
+    case (_, null)                       => Seq(Delta("", Some(x), None, UnknownDelta))
+    case (null, _)                       => Seq(Delta("", None, Some(y), UnknownDelta))
+    case _ if x.getSchema != y.getSchema => Seq(Delta("", Some(x), Some(y), UnknownDelta))
+    case _                               => diff(x, y, x.getSchema, "")
+  }


Like this refactor! 👍

monzalo14 · 2024-02-05T23:19:54Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+        val a = x.asInstanceOf[IndexedRecord]
+        val b = y.asInstanceOf[IndexedRecord]


I like this! Seems more resource-efficient. Is there any case in which this cast could not work, though, so to add in some catch statement?

if schema type is a record, IndexedRecord is the least powerful abstraction we need to check equality. We were previously using GenericRecord that extends IndexedRecord, but equality can be done on field order.

monzalo14 · 2024-02-06T15:26:12Z

ratatool-diffy/src/test/scala/com/spotify/ratatool/diffy/AvroDiffyTest.scala

+import java.nio.ByteBuffer
+import scala.jdk.CollectionConverters._


Have you run sbt scalaFmt? I'm surprised there's so many formatting changes. Are you running a different formatter version?

it is prob my intelliJ organize import. I don't think we have a scalafmt rule for import order.

monzalo14 · 2024-02-06T16:30:28Z

ratatool-diffy/src/test/scala/com/spotify/ratatool/diffy/AvroDiffyTest.scala

-        Delta("repeated_record.nested_repeated_field", Option(jl(10, 20, 30)), None, UnknownDelta),
-        Delta("repeated_record.string_field", Option("b"), None, UnknownDelta)


I see your point. Can you explain/comment on the source of this change so it is more apparent?

monzalo14 · 2024-02-06T18:57:55Z

ratatool-diffy/src/test/scala/com/spotify/ratatool/diffy/AvroDiffyTest.scala

-    )
-  }
-
-  it should "support schema evolution if ignored" in {


Could you clarify how the rest of the tests are covering this test case?

it is not covered. records with different schemas are not equal, as explained above.

ratatool-scalacheck/src/test/scala/com/spotify/ratatool/scalacheck/AvroGeneratorTest.scala

idreeskhan · 2024-02-13T18:27:46Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

-  private def getRawType(schema: Schema): Schema = {
-    schema.getType match {
+  private def numericValue(value: AnyRef): Double = value match {
+    case i: java.lang.Integer => i.toDouble


Why .toDouble here?

Because numericDelta only supports double

Remove the need of data serialization to generate avro specific record. Enable logical-type conversion as described in the avro specification.

idreeskhan · 2024-02-15T18:15:15Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala

+              logger.warn("Avro schemas are compatible, but not equal. Using schema from {}", rhs)
+            }
+            rhsSchema
+          case (COMPATIBLE, INCOMPATIBLE) =>


This is a change in underlying functionality, IMO We should also warn in these cases rather than proceed transparently

Honestly I still think we enforce RHS backwards compatibility unless otherwise shown to be necessary, but if we are changing functionality/flexibility then we need to do so in a way that is transparent to users.

I'll leave the actual decision here down to current members of the owning team

reverted to previous behavior using the SchemaValidatorBuilder that thows a SchemaValidationException with detailed error in case of incompatibility.

RustedBones changed the title ~~Cut beam dep~~ Improve avro support Nov 28, 2023

RustedBones force-pushed the cut-beam-dep branch from 4955b36 to 2cd2a64 Compare November 28, 2023 11:48

RustedBones commented Nov 28, 2023

View reviewed changes

catherinejelder reviewed Dec 29, 2023

View reviewed changes

project/plugins.sbt Outdated Show resolved Hide resolved

RustedBones force-pushed the cut-beam-dep branch from 2cd2a64 to 0cb8f51 Compare January 26, 2024 13:37

RustedBones changed the base branch from master to scio-0.14.0 January 26, 2024 13:38

RustedBones marked this pull request as draft January 26, 2024 13:38

clairemcginty approved these changes Jan 26, 2024

View reviewed changes

ratatool-scalacheck/src/main/scala-2.12/com/spotify/ratatool/scalacheck/HashMapBuilder.scala Outdated Show resolved Hide resolved

ratatool-scalacheck/src/main/scala-2.13/com/spotify/ratatool/scalacheck/HashMapBuilder.scala Outdated Show resolved Hide resolved

Base automatically changed from scio-0.14.0 to master February 5, 2024 08:08

RustedBones force-pushed the cut-beam-dep branch from 0cb8f51 to d0cb104 Compare February 5, 2024 08:14

RustedBones marked this pull request as ready for review February 5, 2024 08:15

benkonz reviewed Feb 5, 2024

View reviewed changes

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala Show resolved Hide resolved

monzalo14 reviewed Feb 6, 2024

View reviewed changes

idreeskhan reviewed Feb 13, 2024

View reviewed changes

RustedBones added 10 commits February 15, 2024 12:01

Pure avro scalacheck generation

8f9fc98

Remove the need of data serialization to generate avro specific record. Enable logical-type conversion as described in the avro specification.

fix sampling

144b32f

Use latest avro for test

a634228

Fix BigDiffy

5c2a440

More powerful map comparison

049e04c

Migrate to scio-0.14

c3ef984

Update copyright year

f84c7ad

remove unused code

a53e663

Add doc

1d49f8e

Add extra schema check

1dd677b

RustedBones force-pushed the cut-beam-dep branch from dfcc526 to 1dd677b Compare February 15, 2024 11:03

Prefer RHS schema as prior behaviour

6bbf8da

RustedBones force-pushed the cut-beam-dep branch from 5b49e78 to 6bbf8da Compare February 15, 2024 12:55

Remove unused imports

00d0694

idreeskhan approved these changes Feb 15, 2024

View reviewed changes

Revert to previous behaviour

dcd413d

RustedBones force-pushed the cut-beam-dep branch from 8e66c03 to dcd413d Compare February 16, 2024 09:43

benkonz merged commit 3eeebfd into master Feb 16, 2024
1 check passed

This was referenced Feb 28, 2024

avroOf does not support logical types #178

Closed

Remove Beam dependency from Ratatool-scalacheck #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve avro support #691

Improve avro support #691

RustedBones commented Nov 28, 2023 •

edited

Loading

codecov bot commented Nov 28, 2023 •

edited

Loading

RustedBones Nov 28, 2023 •

edited

Loading

monzalo14 Feb 6, 2024

idreeskhan Feb 12, 2024 •

edited

Loading

RustedBones Feb 13, 2024

RustedBones Nov 28, 2023

benkonz commented Dec 1, 2023

RustedBones commented Jan 26, 2024

monzalo14 left a comment

monzalo14 Feb 5, 2024

RustedBones Feb 8, 2024

idreeskhan Feb 12, 2024

RustedBones Feb 13, 2024

idreeskhan Feb 13, 2024

idreeskhan Feb 13, 2024 •

edited

Loading

RustedBones Feb 14, 2024

RustedBones Feb 14, 2024

clairemcginty Feb 14, 2024 •

edited

Loading

RustedBones Feb 15, 2024

monzalo14 Feb 5, 2024

monzalo14 Feb 5, 2024

RustedBones Feb 8, 2024

monzalo14 Feb 6, 2024

RustedBones Feb 8, 2024

monzalo14 Feb 6, 2024

monzalo14 Feb 6, 2024

RustedBones Feb 8, 2024

idreeskhan Feb 13, 2024

RustedBones Feb 14, 2024

idreeskhan Feb 15, 2024

idreeskhan Feb 15, 2024

RustedBones Feb 16, 2024

		Delta("repeated_record.nested_repeated_field", Option(jl(10, 20, 30)), None, UnknownDelta),
		Delta("repeated_record.string_field", Option("b"), None, UnknownDelta)


		import org.scalacheck.util.Buildable

		private[scalacheck] object HashMapBuildable {

	val schema = new AvroSampler(rhs, conf = Some(sc.options))
	.sample(1, head = true)
	.head
	.getSchema
	implicit val grCoder: Coder[GenericRecord] = avroGenericRecordCoder(schema)
	val diffy = new AvroDiffy[GenericRecord](ignore, unordered, unorderedKeys)
	val lhsSCollection = sc.avroFile(lhs, schema)
	val rhsSCollection = sc.avroFile(rhs, schema)
	BigDiffy
	.diff[GenericRecord](lhsSCollection, rhsSCollection, diffy, avroKeyFn(keys), ignoreNan)

		val a = x.asInstanceOf[IndexedRecord]
		val b = y.asInstanceOf[IndexedRecord]

		import java.nio.ByteBuffer
		import scala.jdk.CollectionConverters._

Improve avro support #691

Improve avro support #691

Conversation

RustedBones commented Nov 28, 2023 • edited Loading

codecov bot commented Nov 28, 2023 • edited Loading

Codecov Report

RustedBones Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idreeskhan Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benkonz commented Dec 1, 2023

RustedBones commented Jan 26, 2024

monzalo14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idreeskhan Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones commented Nov 28, 2023 •

edited

Loading

codecov bot commented Nov 28, 2023 •

edited

Loading

RustedBones Nov 28, 2023 •

edited

Loading

idreeskhan Feb 12, 2024 •

edited

Loading

idreeskhan Feb 13, 2024 •

edited

Loading

clairemcginty Feb 14, 2024 •

edited

Loading