Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile each schema only once #198

Merged
merged 10 commits into from
Jan 1, 2025

Conversation

mkljakubowski
Copy link
Contributor

@mkljakubowski mkljakubowski commented Dec 19, 2024

#199 I have a big avro library that I'm compiling with avrohugger. It has been great, but the performance is a bit lacking. On the CI it can take a few hours and I have done some experiments and I was able to get it down to 20 mins (around 5% of the original compilation time). I'd like to contribute my improvements, but to make sure that I don't break anything I need a stable environment. I'm running windows and unfortunately none of the tests pass.

}

implicit class LineEndingAmbiguousMatcherString(s: String) extends ExpectationsCreation{
def ===/(other: String): MatchResult[String] = createExpectable(s).applyMatcher(new LineEndingAmbiguousMatcher(other))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test resource files contain LF endings and they are generated using system line endings. In case of windows CRLF, which breaks all the tests. This should compare files regardless of the line endings.

val contents: String = scala.io.Source.fromFile(fileName).mkString
val source = scala.io.Source.fromFile(fileName)
val contents: String = source.mkString
source.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream wasn't closed and in this a rec function, which breaks on the 2nd execution

@julianpeeters
Copy link
Owner

I'd like to contribute my improvements in a few PRs

Ok, sounds great. Thanks very much for the contributions.

main/trunk is kept up-to-date with what's published, so I can offer the following options:

  1. I can publish milestones for each round/PR, especially if you've got multiple ready to go. E.g., 2.10.0-M1
  2. Or I can simply publish 2.10.0, if you prefer a real release sooner, and more PRs delayed

@mkljakubowski
Copy link
Contributor Author

It is up to you. I don't have anything ready, it will take me a bit of time to make all of it review worthy. I don't plan to break the API, at most extend it a little. Would you rather I push everything here and create one giant PR or a set of smaller ones?

restrictedFields: Boolean,
targetScalaPartialVersion: String): Unit = {
inFiles.flatMap(fileParser.getSchemaOrProtocols(_, format, classStore, classLoader))
.distinct
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest change. This compiles each avro schema only once for a set a given input source files.
Doing this speeds up compilation by 65% in my case, but requires also a small change in build plugins (sbt-avrohugger etc)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.distinct

In the scripted tests in sbt-avrohugger, distinct is throwing an exception when the schema has no namespace (java classes, wreckless with null)

[info] [info] Compiling Avro IDL to /private/var/folders/1m/4vqhbccd6l7dkz3t8q766fdr0000gn/T/sbt_dc4d7335/target/scala-2.12/src_managed/main/compiled_avro
[info] [error] java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null
[info] [error]  at org.apache.avro.Protocol.hashCode(Protocol.java:400)
[info] [error]  at scala.runtime.Statics.anyHash(Statics.java:122)
[info] [error]  at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:76)
[info] [error]  at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:246)
[info] [error]  at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:167)
[info] [error]  at scala.util.Right.hashCode(Either.scala:438)
[info] [error]  at scala.collection.mutable.FlatHashTable.addEntry(FlatHashTable.scala:153)
[info] [error]  at scala.collection.mutable.FlatHashTable.addEntry$(FlatHashTable.scala:152)
[info] [error]  at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
[info] [error]  at scala.collection.mutable.FlatHashTable.addElem(FlatHashTable.scala:144)
[info] [error]  at scala.collection.mutable.FlatHashTable.addElem$(FlatHashTable.scala:143)
[info] [error]  at scala.collection.mutable.HashSet.addElem(HashSet.scala:41)
[info] [error]  at scala.collection.mutable.HashSet.add(HashSet.scala:66)
[info] [error]  at scala.collection.SeqLike.distinct(SeqLike.scala:519)
[info] [error]  at scala.collection.SeqLike.distinct$(SeqLike.scala:509)
[info] [error]  at scala.collection.AbstractSeq.distinct(Seq.scala:45)
[info] [error]  at avrohugger.generators.FileGenerator.filesToFiles(FileGenerator.scala:116)
[info] [error]  at avrohugger.Generator.filesToFiles(Generator.scala:98)
[info] [error]  at sbtavrohugger.FileWriter$.generateCaseClasses(FileWriter.scala:40)
[info] [error]  at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$8(SbtAvrohugger.scala:82)
[info] [error]  at sbt.util.FileFunction$.$anonfun$cached$1(FileFunction.scala:81)
[info] [error]  at sbt.util.FileFunction$.$anonfun$cached$4(FileFunction.scala:154)
[info] [error]  at sbt.util.Difference.apply(Tracked.scala:415)
[info] [error]  at sbt.util.Difference.apply(Tracked.scala:395)
[info] [error]  at sbt.util.FileFunction$.$anonfun$cached$3(FileFunction.scala:150)
[info] [error]  at sbt.util.Difference.apply(Tracked.scala:415)
[info] [error]  at sbt.util.Difference.apply(Tracked.scala:390)
[info] [error]  at sbt.util.FileFunction$.$anonfun$cached$2(FileFunction.scala:149)
[info] [error]  at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$7(SbtAvrohugger.scala:84)
[info] [error]  at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[info] [error]  at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[info] [error]  at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[info] [error]  at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[info] [error]  at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[info] [error]  at sbt.Execute.work(Execute.scala:292)
[info] [error]  at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[info] [error]  at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[info] [error]  at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[info] [error]  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
[info] [error]  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
[info] [error]  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
[info] [error]  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
[info] [error]  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
[info] [error]  at java.base/java.lang.Thread.run(Thread.java:1583)
[info] [error] (Compile / avroScalaGenerate) java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null
[info] [error] Total time: 1 s, completed Dec 21, 2024, 1:18:10 PM
[error] x avrohugger/GenericSerializationTests 

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using distinctBy, but that appeared in scala 2.13, while avrohugger needs to publish against 2.12 (in order to support the sbt plugin). the following is basically copy-paste from stack overflow, you may prefer a different solution.

      def distinctByFullName(list: List[Either[Schema, Protocol]]): List[Either[Schema, Protocol]] =
        list.filterNot{
          var set = Set[String]()
          e =>
            val property = e match {
              case Left(value) => value.getFullName()
              case Right(value) => Option(value.getNamespace).fold(value.getName())(ns => s"${ns}.${value.getName()}")
            } 
            val b = set(property);
            set += property;
            b
        }

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the sbt-plugin, I think the needed changes will be fine. E.g., Filewriter can be like, etc.

...
    // for (inFile <- AvscFileSorter.sortSchemaFiles(getSrcFiles(srcDirs, "avsc"))) {
    //   log.info("Compiling AVSC %s to %s".format(inFile, target.getPath))
    //   generator.fileToFile(inFile, target.getPath)
    // }

    val avscFiles: Seq[File] = getSrcFiles(srcDirs, "avsc")
    if (avscFiles.isEmpty) ()
    else
      log.info("Compiling AVSC to %s".format(target.getPath))
      generator.filesToFiles(AvscFileSorter.sortSchemaFiles(avscFiles).toList, target.getPath)
 ...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these snippets aren't setbacks to your performance improvements, then I can publish this round as version 2.10.0, and we can build from there at our convenience.

Copy link
Contributor Author

@mkljakubowski mkljakubowski Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check improvements in new commits.
We have to preserve order of compilation, so we can't turn the list into a set. We can only use it to lookup if a schema was already processed.
I retested with my avro libs and it is still as fast and works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julianpeeters/sbt-avrohugger#105 and here is matching sbt-avrohugger PR in case you decide to publish 2.10.0

@mkljakubowski
Copy link
Contributor Author

This is as far I would like to go in first round of improvements. It already makes a huge difference in my build pipeline. Please review and let me know how you would like to proceed.

@mkljakubowski mkljakubowski changed the title Fix tests on windows Compile each schema only once Dec 20, 2024
@julianpeeters
Copy link
Owner

Looking good. A couple issues, but, once resolved, this can be 2.10.0.

@julianpeeters julianpeeters merged commit 8af3cc4 into julianpeeters:main Jan 1, 2025
6 checks passed
@julianpeeters
Copy link
Owner

Good stuff, thanks for the speed-ups. avrohugger 2.10.0 and sbt-avrohugger 2.10.0 are available now on maven central.

@mkljakubowski mkljakubowski deleted the speedup-round1 branch January 15, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants