-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of Analysis deserialization. #984
Comments
This test evaluates how much of the analysis files on-disk size is consumed by the used names, and how much of the loading time is attributable to that data. package sbt.inc.binary
import sbt.internal.inc.FileAnalysisStore
import xsbti.compile.AnalysisContents
import java.io.File
import java.nio.file.{ Files, Path }
import java.util.concurrent.TimeUnit
object UsedNameTest {
def main(args: Array[String]): Unit = {
val inFile = new File("/Users/jz/code/scala/target/library/zinc/inc_compile.zip")
val store = FileAnalysisStore.binary(
inFile
)
val contents = store.get().get()
val analysis = contents.getAnalysis.asInstanceOf[sbt.internal.inc.Analysis]
val outFile = new File("/tmp/inc_compile.zip")
val outStore = FileAnalysisStore.binary(outFile)
outStore.set(
AnalysisContents.create(
analysis.copy(relations = analysis.relations.withoutUsedNames),
contents.getMiniSetup
)
)
printSize(inFile.toPath)
printSize(outFile.toPath)
def timedLoad(path: Path): Unit = {
val now = System.nanoTime()
FileAnalysisStore.binary(path.toFile).get.get().getAnalysis
val end = System.nanoTime()
println(path + " loaded in " + TimeUnit.NANOSECONDS.toMillis((end - now)) + "ms")
}
(1 to 32).foreach(_ => timedLoad(inFile.toPath))
(1 to 32).foreach(_ => timedLoad(outFile.toPath))
}
private def printSize(path: Path): Unit = {
println(path + " size = " + +Files.size(path))
}
}
This shows about 15-20% of the space (a bit less that I guesstimated) and 30-35% of the time. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We have been looking at ways to improve the performance of analysis deserialization. In large builds this can be a major factor in the responsiveness of the build when Zinc determines that little or no actual compilation is needed.
A quick way to profile this is to checkout and compiler scala/scala, the run the following:
Attach async-profiler:
Opportunities
Inefficient creation of
Array
javaList.asScala.iterator.map(f).toArray
forgoes the chance to allocate the resulting array with the correct size and needs an intermediate buffer. Try something like:Dead code creates wasted
Vector
String interning isn't free
Interning likely-duplicated strings as we read them from the protobuf trades off higher CPU to achieve lower footprint. Maybe it is worth thinking of introducing a name table in the start of the protobuf and replacing fields of type string with indices into this table?
Eager Relation Building is costly
All maps that are deserialized are converted to a
Relation[A,B]
, a pair of maps, one with the forward relations another with the inverse. The inverse map is of type is aMap[B, Set[A]]
.The
usedNames
map appears to be the largest. Is the inverse relation actually used? If not, refactor Zinc to avoid building this up!The other cost here is building immutable Map/Sets, rather than just upcasting mutable one to
collection.{Map,Set}
. This may well be worth it if clients ofMRelationsNameHashing
use its operations to construct new versions of it; these would structurally share with the base version. But its worth analysing how these are used.It looks like the value in the forward relation in
UsedName
is not actually accessed as aSet[UsedName]
, a sequential collection would suffice.Furthermore, the list of used names for a file is only consulted if there is an
memberRef
relation between the file with a changed API to the candidate file for invalidation. This is an opportunity for laziness.Lazy parsing?
If the used names data for many compilation units is actually not needed, can we avoid reading it from disk and parsing it in the first place? This would require an on-disk format with random access into the map of used names for a given unit.
Protobuf's documentation notes:
I could imagine an incremental change to the current format included an entry in inc_compile.zip for each compilation unit.
Moar Hashing?
Another #brainstorm: what if we compressed the used-names section by hashes of the names? i.e. rather than recording that
A.scala -> ["println", "Predef", ...]
, we instead record that is references["println".hashCode, "Predef".hascode, ...]
? Saving ints rather than Strings would reduce the size of the persisted form, avoid UTF-8 decoding and String interning on deserialization.In case of hash collisions we would get over-compilation (although it would not propagate far.)
The text was updated successfully, but these errors were encountered: