-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inch toward the finish line: As Seen From & Scala std lib processing #6
Comments
From past experience, I can tell you that one of the most difficult part of compiling the collections in the standard library is proper support for F-bounded polymorphism, there are potential cycles everywhere that need to be avoided like minefields, some examples here: https://github.com/lampepfl/dotty/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Amerged%20cyclic%20reference%20 |
Interesting! Thanks a lot for the reference. I suspect I might get away with not stepping into the F-bounded polymorphism minefield. The F-bounded polymorphism is where parametric polymorphism and subtyping meet. However, outline types deliberately do not support subtyping checks so in Kentucky Mule F-bounded polymorphism collapses into the regular parametric polymorphism. I'll keep an eye on this while working on the implementation, though. |
I just pushed the implementation of package object in Kentucky Mule! Implementing it gave me an interesting insight into the cost of this language feature. I'm copying my notes from Package objectsInitially, Kentucky Mule didn't support package objects. I felt that they're
This makes it pretty tricky to compute all members of a package. Without package My current plan is to have two different paths for completing package type:
The end result will be that lookups through package type will be as performant Nov 5th, 2017 updateI implemented the package object support via adding package types as described
The tradeoff between 1. and 2. is the performance cost of PackageCompleter's vs lookups The 1. optimizes for faster package completer: if there's no The 2. optimizes for faster package lookups: all members are collapsed into Initially, I thought 2. is too expensive. Copying of all members sounded really Package object performance costEven with the very careful analysis of all options I could think of, I didn't
Enter performance is about the same. However, I was curious what's the source of slow-down, though. One way to look at is the |
master now contains support for the empty package. Surprisingly, the empty package handling in Scala is a very good example how to not design software. The empty package is described in the Scala specification 9.2:
Reading this description, empty package (packaging) seem to be a straightforward and a small technical detail of Scala. I thought I'd implement it quickly and move onto more interesting problems. I ended up spending a fair amount of time debugging what turned out to be an odd behavior of both scalac and dottyc. I added a 70 lines long comment explaining my original wrong assumptions and my attempts at fixing my implementation and the eventual surrender to an ugly hack. Let's look at an example: // A.scala
class A
package foo {
class B
} reading the spec, one would expect this to be interpreted as: // A.scala
package <empty> {
class A
}
package foo {
class B
} however, both scalac and dottyc parse the original example as: package <empty> {
class A
package foo {
class B
}
} The rule the parser follows is: if you have at least one top-level definition outside of a package, everything is wrapped into an empty package declaration. This AST structure is slightly wrong: no package is allowed to be declared inside of the empty package. I wondered why would one wrap all declarations in the empty package declaration. I believe it's an example of a misguided sense of an economic behavior: cutting down on types of the ast nodes, unnecessarily. The parser is expected to return a single ast node, so multiple top-level declarations need to be wrapped into something. One solution is to introduce an ast node called |
I just pushed implementation of import renames. Here're my notes: Let's consider an example: object A {
object B
}
import A.{B => B1, _} In addition to fixing lookup for
To implement wildcard as specified above, I have to remember if the name I'm looking for in a given import clause has appeared in any of its selectors. This is done while scanning selectors for a match. If I don't find a match, and if the name didn't occur in any of the selectors and import clause has a wildcard selector, I perform a member lookup from the stable identifier for the import clause. Interestingly, I found a bug in scalac that allows rename of a wildcard to an arbitrary name: scala> object Foo { class A }
defined object Foo
scala> import Foo.{_ => Bla}
import Foo._
// A is available because the import is a wildcard one despite the rename!
scala> new A
res1: Foo.A = Foo$A@687a0e40
// Bla is not available
scala> new Bla
<console>:16: error: not found: type Bla
new Bla Now, sadly, this commit introduces 5% performance regression for very unclear reasons. We went from around 2050 ops/s to 1950 ops/s. I tried to narrow it down. This simple patch restores most of the lost performance (brings us back to 2040 ops/s): diff --git a/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala b/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
index ed9e1fb5e1..ee75d3dc45 100644
--- a/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
+++ b/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
@@ -596,14 +596,14 @@ object Enter {
// all comparisons below are pointer equality comparisons; the || operator
// has short-circuit optimization so this check, while verbose, is actually
// really efficient
- if ((typeSym != null && (typeSym.name eq name)) || (termSym.name eq name)) {
+ if ((typeSym != null && (typeSym.name eq name))/* || (termSym.name eq name)*/) {
seenNameInSelectors = true
}
if (name.isTermName && termSym != null && (selector.termNameRenamed == name)) {
- seenNameInSelectors = true
+// seenNameInSelectors = true
termSym
} else if (typeSym != null && (selector.typeNameRenamed == name)) {
- seenNameInSelectors = true
+// seenNameInSelectors = true
typeSym
} else NoSymbol
}
@@ -616,7 +616,7 @@ object Enter {
// If a final wildcard is present, all importable members z
// z of p other than x1,…,xn,y1,…,yn
// x1, …, xn, y1,…,yn are also made available under their own unqualified names.
- if (!seenNameInSelectors && hasFinalWildcard) {
+ if (/*!seenNameInSelectors &&*/ hasFinalWildcard) {
exprSym0.info.lookup(name)
} else NoSymbol
} If you look at it closely, you'll see that the change is disabling a couple of pointer equality checks and assignments to a local, boolean variable. This shouldn't drop the performance of the entire signature completion by 4%! I haven't been able to understand what's behind this change. I tried using JITWatch tool to understand the possible difference of JIT's behavior and maybe look at the assembly of the method: it's pretty simple, after all. |
This note is about how I think of staying true to my original Development principles, in particular being absurdly paranoid about the performance, and getting anything done. While working on the package object support and import rename support, I encountered two different performance regressions that led me to developing some simple tactics and tools (right metrics!) that help me deal with performance losses efficiently. I'll tell the backstory how I came about developing them. Package objectsWhen I originally implemented package objects support, I saw almost 50% drop in performance of completing type signatures. I expected some performance hit. I explained in earlier comment the source of the slowdown. However, I was completely lost why this feature would cost 50% of my performance. I opened Yourkit profiler and tried comparing profiles before and after my change. They looked completely different which didn't make any sense. After all, I was benchmarking on "Was I hitting some pathological case in JIT?", "Was I lucky before because I was particularly JIT friendly and suddenly, the code became to complex to optimize it well?", "Is Kentucky Mule a house of cards that's falling in front of me?" were the thoughts I was having at the time. Then I thought that I might be doing something expensive without realizing it. Out of curiosity, I checked the stats on how many completer tasks are executed during processing
And then for package object impl commit it look like this:
Aha, so we're suddenly missing a lot of dependencies. Packages now always have completers: package completers check for the existence of package objects to determine the final list of members of a package. So even though, Every other completer, directly or indirectly, depends on looking up something in a package. Here's what was happening. Let's pick a simplified example: package foo
class A extends B
class B extends C
class C The queue at the beginning is: The first completer tries to lookup The same goes for the second completer, and so on. We can see what's happening: package foo that is scheduled to be completed last is a dependency for everybody else. It forces the entire queue to run once without completing anything. We do twice as much work as a result! The actual details were more complicated but the gist is right there: packages should be manually scheduled to be completed first. I made a simple change to enforce that and both reported number of completers jobs executed and the actual performance numbers were back on track. It's also clear why yourkit was useless in this case: the profile was indeed completely different, for half of the execution of the run, the results returned by completers were different. This caused the entire tree of method calls to have different timings and simply look incomparable. Profilers like yourkit are good at spotting local changes to the runtime profile but are really bad at helping with the global changes like the one I had. For some time I've been thinking of Kentucky Mule to be an exploration of applying systems programming thinking to compiler design. This is the best story so far supporting that approach. Systems programmers are very good at coming up with units of computational abstraction that have predictable performance characteristics and give rise to thinking of performance in those units instead of raw nanoseconds. If it's a row in a file for MapReduce, or a row in a database, both are abstractions that can be reasoned about their performance in terms of their quantity. For Kentucky Mule, that unit of abstraction is a completer together with the completer queue. These days, the first check I perform is looking at the size of my completers queue and dependency misses to spot check whether I didn't make any mistake. Having these stats is a delightful superpower in reasoning about runtime characteristics of my little system. Import renamesImport renames introduced a mysterious performance regression by 5% and I couldn't figure out why. It was frustrating. I set out to be paranoid about every last bit of performance and I didn't want to give up on investigating this problem. It looked like some JIT problem that might be interesting to investigate and learn from. But JIT analysis tools are immature (they crashed for me). And what about 12 other tasks still waiting to be finished? When it's time to calls it quits because the ROI is not there? I did a simple math. Let's assume that I have not 12 other tasks still waiting to be finished but 30 because I haven't discovered all remaining work. There's always more work than you think. Let's assume that I lose 5% of performance every time I implement one of the remaining tasks. The total performance would drop down to 0.95^30 ~ 0.2 of the current performance. That is: the cumulative slowdown would be 5x so not even an order of magnitude. Kentucky Mule processes scalap sources at ~4 million lines of code so even an order of magnitude performance drop would be still an amazing result if I had all Scala language features implemented. If I can't figure out the source of that 5% drop, I don't lose sleep over it. This points to some larger truth when it comes to performance in software engineering: if you have a great start, you have a lot of surplus performance you can spend on non-essential aspects of the project. This is the opposite of "premature optimization is the root of all evil". And it's ok. Kentucky Mule is all about pushing the limits of the performance and also getting something done. |
On January 17 of 2018, I finished a redesign of the completers queue. Now, I'm writing notes The concern of completers queue is scheduling and an execution of jobs according to a dependency graph that is not known upfront. The dependency graph is computed piecemeal by executing completers and collecting missing dependencies. Only completers that haven't run yet are considered as missing dependencies. In my early notes, in the "Completers design" section, I described the origins of the cooperative multitasking design of completers and I speculated what capabilities that design could unlock. The redesigned completers queue takes these earlier speculations and turns them into an implementation. The queue:
To my astonishment, I came up with a design of completers queue that can safely run in parallel with a minimal coordination and therefore is almost lock-free. This is a very surprising side effect of me thinking how to detect cycles with a minimal overhead. To explain the new design, let me describe the previous way of scheduling completers. During the tree walk, when I enter all symbols to symbol table, I also schedule the corresponding completers. Completers for entered symbols are appended to the queue in the order of symbol creation. The order in the queue doesn't matter at this stage, though. Once the queue execution starts,
The first case is straightforward: the completer did its job and is removed from the queue. In the second case, we append to the queue two jobs:
If we picked up from queue completer This very simple scheduling strategy worked ok to get Kentucky Mule off the ground. However, it can't deal with cyclic dependencies. E.g.: class A extends B
class B extends C
class C extends A Would result in the job queue growing indefinitely:
Even if we didn't put the same job in the queue multiple times, we would get into an infinite loop anyways. In Kentucky Mule cycles appear only when the input Scala program has an illegal cycle. The result of running the queue should be an error message about the detected cycle. The new design of the queue keeps track of pending jobs off the main queue. A job named
In the example above with the
If some blocking job does complete, its pending jobs need to be released back to the main queue. I implemented this by maintaing an auxiliary queue associated with each job. I also have a global count of all pending jobs that is trivial to implement and makes checking for a cycle simple. This is not merely a performance optimization: I don't have a registry of all auxiliary queues so it would be hard to find out if there're any pending jobs left. I drew a picture that illustrates execution of jobs The execution is in 8 steps. In each step, we take a completer from the head of the queue and run it. If it returns a result signaling it's blocked on another completer, we place it in that completer's auxiliary queue. E.g. in the first step the completer The exciting thing about this design is that we check for cycles only once during the whole execution of the job queue: when the main job queue becomes empty. Moreover, there's no dependency in the execution order of tasks in the main queue. This design permits the jobs to run in parallel at ~zero cost precisely because the cycle detection criterion is so simple. In Kentucky Mule all heavy works is done through completers queue that now can be executed with a high degree of parallelism. Now here's the kick. I came up with Kentucky Mule with an idea that I can have a very efficient sequential phase in the compiler that unlocks parallelism in other phases. I draw a picture of Kentucky Mule as a one box in my earlier blog post. The new design of completers queue breaks down Kentucky Mule into many parallel boxes and pushes parallelism of the overall design to a place I didn't expect to reach. Once I realized that, I had to go for a run to channel the thrill of the discovery. It was that good. My redesign abstracted away completers from the queue by introduction of trait QueueJob {
def complete(): JobResult
def isCompleted: Boolean
val queueStore: QueueJobStore
} The job returns a result that is an ADT: sealed abstract class JobResult
case class CompleteResult(spawnedJobs: Seq[QueueJob]) extends JobResult
case class IncompleteResult(blockingJob: QueueJob) extends JobResult The Each job has the Let's now look at the performance cost of the additional bookkeeping necessary to implement this queue algorithm. The performance before job queue work:
with all job queue changes:
I lost ~110ops which is 5.5% of performance. This is actually really good result for such a substantial piece of work like major queue refactor and cycle detection algorithm. And let's not forget that 1820ops/s corresponds to over 3.6 million lines of scalap code per second. Thanks to Filip Wolski for discussions and inspiration on how to approach the problem of cycle detection in a graph computed online. |
Today I'm writing a note on why package objects with inheritance are ill-defined in Scala and why this is a headache. I describe a workaround that I implemented in late January 2018. Over two months later, I still have a feeling of surrender to a feature Scala shouldn't have. To understand the problem with package objects with inheritance, let's consider this example: package foo {
package bar {
class A {
class B
}
}
}
package foo {
package object bar extends A {
class C
}
} The problem here is that My original implementation of package object support ruled out this scenario with a cyclic error: package I resorted to implementing a "stripped down" version of member lookup in a package when the request comes from a package object. The stripped down version only looks at declarations in the package and bypasses the need for completing the package's type. Specifically, only members declared syntactically within I wasn't sure how exactly to implement this special case without sacrificing code structure or performance. I ended up passing an instance of a special package lookup class specifically to a completer of a package object. This achieves two goals at the same time:
The details of the implementation are in my commit ad1999d. I'm happy I managed to retreat from the minefield of package objects with parent types. I'm convinced that this trap shouldn't exist in Scala in the first place, though. |
Your change looks pretty nice and localized in how it deals with a tricky problem. :-)
Agree on that. I think the weird syntax for defining package object is largely to blame.
it simply could have been
This would have
Do you remember any details how the current syntax came to be? |
Me too: scala/scala-dev#441 |
object package { ... } is valid and works as desired if you wrap object `package` { ... } @gkossakowski's example can be rewritten as: package foo {
package bar {
class A {
class B
}
object `package` extends A {
class C
}
}
}
|
@andyscott yes, syntactically, you can bring more clarity. scoping remains icky: |
(i've rewritten my note to be a little bit more scoped)
but i changed the wording to be:
@soc i don't remember the history of package object introduction, i think they were introduced in Scala 2.8 together with the redesigned scala collections and I think it was clear already at the time that their semantics need more work to be nailed I'm glad to hear the consensus is to deprecate inheritance in package objects |
On April 8, 2018 I pushed changes that make the handling of imports more lazy and I'm writing a note on why. Let's consider this example: import A.X1
import A.X2
import A.X3
class B extends X2
object A {
class X1
class X2
class X3
} The previous strategy of handling imports would resolve them upon first However, I forgot about the scenario like this that's allowed in Scala: package a {
package b {
// this wildcard import is ok: it forces completion of `C` that is not dependent on `A` below
import C._
// import A._ causes a cyclic error also in scalac; only importing a specific name
// avoids the cycle
import A.X
object A extends B {
class X
}
object C
}
class B
} that is: we have forward imports in Scala.
The patch I pushed today introduces an The new logic won't consider all imports but only imports possibly With this change, the lookup for the type This concludes my initial musings on whether imports need to be handled Before the change:
and after:
I don't have a good intuition why. Maybe some imports were not necessary to be resolved at all (dead code) in such a simple code base like scalap? |
Today I'm sharing my older note from March 20th, 2018 on collapsing inherited members (or not). I keep this note also in my notes.md file. Kentucky Mule copies members from parent classes. It does it for two reasons:
I'm going to write a short note on the 1. The idea was to collapse member lookup I want to highlight that this strategy leads to an class A1 { def a1: Int }
class A2 extends A1 { def a2: Int }
class A3 extends A2 { def a3: Int }
...
class A10 extends A9 { def a10: Int } The class The strategy of collapsing all members into a single hashtable was motivated Switching between strategies was easy. The benchmarks show that there's Collapsed members:
Traversing the parents for a lookup:
With no noticeable difference in performance, I'm going to change member lookup to traversing the parents. |
Background read: Can Scala have a highly parallel typechecker?
This is a meta issue with a quick overview of all tasks I came up when I thought of adding support for the As Seen From (ASF) algorithm and for processing the Scala std lib to Kentucky Mule.
I'm following the format from my work on zinc that turned out to be successful in managing the complexity of a difficult project I worked on in the past: sbt/sbt#1104
It shipped as part of the sbt 1.0 release recently.
This ticket describes the second stage in Kentucky Mule's development. The first stage was about validating whether the ideas for rethinking compiler's design with focus on performance are in touch with how fast computers actually compute. The second stage is about confronting Kentucky Mule's ideas with what I consider the most tricky aspects of Scala's design to implement in a performant way.
I'm planning to update this issue with interesting findings and roadblocks I hit working towards finishing the tasks listed below. My intent for this issue is twofold:
Context
In my most recent blog post on Kentucky Mule, I wrote:
I'm breaking up the work required for adding the As Seen From and some other missing Scala language features to Kentucky Mule into a list of smaller tasks. The list is not meant to be set in stone and will get refined as more issues and tasks come to light.
Once I surfaced the list, I realized the scope is a bit larger than I originally speculated. I'm revising previous prediction of 15 days of deeply focused effort to complete this project and bump it to 20 days.
Tasks
Missing language features
type T = String
type Foo[T] = List[T]
this
in pathsPredef
to every compilation unitscala
andjava
packages to every compilation unitsuper
in pathsImplementation features
Features that are not strictly language features but need to be implemented for other reasons.
As Seen From
Surprisingly, Kentucky Mule implements some aspects of the ASF already. E.g. Rule 3 for the type parameters is partially supported. I don't have a good intuition what's missing even when I have ASF's precise definition in front of my eyes. For performance reasons, Kentucky Mule implements ASF's features differently from how they're specified so simple diffing between the implementation and the spec doesn't cut it.
I'll come back to this section as the implementation of the other language features mentioned above progresses.
Status
I haven't touched Kentucky Mule for almost a year and I'm picking it up now to work on tasks in this ticket continously until I check all the boxes. Kentucky Mule remains my side project so I aim for a steady but slow progress done over the course of some of my evenings.
I implemented the original, first stage of Kentucky Mule in a short span of time when I was working on it full time. I'm curious if my 20 days prediction of completing this project will hold when the days are broken into a larger number of evenings.
The text was updated successfully, but these errors were encountered: