Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit more efficient code for simple for-comprehensions #8880

Open
danlehmann opened this issue May 4, 2020 · 30 comments
Open

Emit more efficient code for simple for-comprehensions #8880

danlehmann opened this issue May 4, 2020 · 30 comments

Comments

@danlehmann
Copy link

danlehmann commented May 4, 2020

I've been using Scala for years to write performance-critical code. One the pain points has always been the sub-par performance or seemingly benign for-comprehensions.

While it is clear that complex for-comprehensions emit inefficient code, a very common use-case is using them as a replacement for simple for-loops.

As an example, the two loops below do the same thing. The for-comprehension is cleaner (i is scoped more narrowly, no need for var). But one look at the decompiled code shows you just how much more inefficient the for-comprehension is.

Minimized code

class LoopTest {
  final val count = 1000

  def testWhile(): Unit = {
    var i = 0
    while (i < count) {
      Console.println("Number: " + i)
      i += 1
    }
  }

  def testFor(): Unit = {
    for (i <- 0 until count) {
      Console.println("Number: " + i)
    }
  }
}

Output

I compiled using dotty-0.24.0-RC1. This is the decompiled output:

public void testWhile() {
  for (int i = 0; i < 1000; ++i) {
    Console$.MODULE$.println((Object)("Number: " + i));
  }
}
    
public void testFor() {
  RichInt$.MODULE$.until$extension(
    Predef$.MODULE$.intWrapper(0), 1000).foreach(
      (Function1)(i -> Console$.MODULE$.println((Object)("Number: " + i))));
}

Expectation

Is there some way to allow using for-comprehensions with a guarantee that performant code is being produced? Maybe it would be possible to introduce something like @scala.annotation.for (analogous to @scala.annotation.switch)?

For me, such an improvement would have a big impact: In my project, I spent hours translating for-comprehensions into simple while loops to reap (measurable) performance benefits. But it is unfortunate that in the process I had to make my code less readable and more error prone. Scala 2.13's inliner helped somewhat, but the results weren't reliable, so I mostly ended up avoiding for-comprehensions.

@nicolasstucki
Copy link
Contributor

A simple way to remove the foreach is using inline

case class MyRange(start: Int, end: Int) {
  inline def foreach(inline f: Int => Unit) =
    var i = start
    val e = end
    while (i < e) {
      f(i)
      i += 1
    }
}

object App {
  val count: Int = 4
  for (i <- MyRange(0, count)) {
    Console.println("Number: " + i)
  }
}

Which generates the code

val count: Int = 4
    {
      val MyRange_this: MyRange = MyRange.apply(0, App.count)
      {
        var i: Int = MyRange_this.start
        val e: Int = MyRange_this.end
        while(i < e) {
          Console.println("Number: ".+(i))
          i = i.+(1)
        }
      }:Unit
    }

We could also try to use opaque types to avoid the creation of the object in some cases.

@smarter
Copy link
Member

smarter commented May 5, 2020

inline def foreach(inline f: Int => Unit) =

We cannot do this in practice currently because Range#foreach overrides a non-inline foreach so its parameter cannot be marked inline, so we would need to allow inline overloads which only differ in whether their parameter is inline or not (also the version that takes an inline parameter should not be called when the argument we pass to it has side effects like foreach({println("a"); x => ()), that would break semantics). Alternatively, maybe the inline typer could automatically inline pure arguments like lambdas ?

@LPTK
Copy link
Contributor

LPTK commented May 6, 2020

also the version that takes an inline parameter should not be called when the argument we pass to it has side effects like foreach({println("a"); x => ()), that would break semantics

Ouch, that sounds like a pretty big semantic landmine... almost at the level of surprise factor as the old (deprecated) Map#mapValues. It's not uncommon to see code like xs.foreach(foo(bar)) where foo is a curried function. One wouldn't expect foo(bar) to be called again for each element of the collection (that could be terrible for performance, even if foo was pure).

Are there legitimate uses of inline parameters which do not create these semantic landmines?

@danlehmann
Copy link
Author

We could also try to use opaque types

I gave that a shot but pretty quickly ran into compiler issues. It seems that inline and opaque types are mutually exclusive (see #6802 )

@neko-kai
Copy link
Contributor

neko-kai commented May 8, 2020

@dhoepelman What if you worked around it by defining extension methods outside of the companion object in a trait, then implement the trait as given in companion object?

object X {
  given ExtensionOps
}
trait ExtensionOps {
  inline final def map(...)...
}

@danlehmann
Copy link
Author

With extension methods, I got pretty close.

object Ranges {
  opaque type ZeroBasedRange = Int

  object ZeroBasedRange {
    given ExtensionOps

    def apply(end: Int): ZeroBasedRange = end
  }

  trait ExtensionOps {
    inline final def (range: ZeroBasedRange) foreach(inline f: Int => Unit) = {
      var i = 0
      val e = range.asInstanceOf[Int]
      while (i < e) {
        f(i)
        i += 1
      }
    }
  }
}

object App {
  final val count: Int = 4
  for (i <- Ranges.ZeroBasedRange(count)) {
    Console.println("Number: " + i)
  }
}

This compiles into this:

    private App$() {
        MODULE$ = this;
        this.count = 4;
        final Ranges$ module$ = Ranges$.MODULE$;
        final int range = Ranges.ZeroBasedRange$.MODULE$.apply(this.count());
        for (int i = 0, e = range; i < e; ++i) {
            Console$.MODULE$.println((Object)("Number: " + i));
        }
        final BoxedUnit unit = BoxedUnit.UNIT;
    }

Apart from that one apply function call (which I'd assume hotspot would optimize out), this seems to be optimal code. Of course, it only works for ranges starting with 0.

@danlehmann
Copy link
Author

I wanted to see if there's a way to generalize this to Ranges that don't start at 0. The only way I could think of is by packing two Ints into a Long like this:

object Ranges {
  opaque type AnyRange = Long

  object AnyRange {
    given AnyRangeExtensionOps

    def apply(start: Int, end: Int): AnyRange = start.toLong << 32 | end.toLong
  }

  trait AnyRangeExtensionOps {
    inline final def (range: AnyRange) foreach(inline f: Int => Unit) = {
      var i = (range.asInstanceOf[Long] >> 32).toInt
      var e = range.asInstanceOf[Long].toInt
      while (i < e) {
        f(i)
        i += 1
      }
    }
  }
}

object App {
  final val count: Int = 4
  for (i <- Ranges.AnyRange(0, count)) {
    Console.println("Number: " + i)
  }
}

This produces relatively efficient code:

    private App$() {
        MODULE$ = this;
        this.count = 4;
        final Ranges$ module$ = Ranges$.MODULE$;
        final long range = Ranges.AnyRange$.MODULE$.apply(0, this.count());
        for (int i = (int)(range >> 32), e = (int)range; i < e; ++i) {
            Console$.MODULE$.println((Object)("Number: " + i));
        }
        final BoxedUnit unit = BoxedUnit.UNIT;
    }

    // AnyRange$.apply:
    public long apply(final int start, final int end) {
        return (long)start << 32 | (long)end;
    }
    

While there's some bit-shifting, we at least don't pay any cost for object creation.

@smarter
Copy link
Member

smarter commented May 10, 2020

Note that using bitwise tricks like that can potentially confuse the JIT and prevent it from eliding bounds checks.

@neko-kai
Copy link
Contributor

You can use inline matches to create and then pattern-match on a phantom compile-time only case class - this should allow you to omit the structure completely when it's known at compile-time and fallback to accessing the structure fields otherwise.

@danlehmann
Copy link
Author

Note that using bitwise tricks like that can potentially confuse the JIT and prevent it from eliding bounds checks.

Any idea how to accomplish this with opaque types? Also, I'd assume that while it isn't ideal, the current Scala/Dotty solution of creating a Range instance would have the same issue?

@danlehmann
Copy link
Author

You can use inline matches to create and then pattern-match on a phantom compile-time only case class - this should allow you to omit the structure completely when it's known at compile-time and fallback to accessing the structure fields otherwise.

Would you mind giving me a pointer to how this could be done? I looked up phantom classes, but it seems like they were replaced by unused, which was then renamed to ghost. And I can't find any current document about that now :-)

@smarter
Copy link
Member

smarter commented May 10, 2020

Any idea how to accomplish this with opaque types?

No, but I don't think that's a good solution anyway compared to getting inlining working right.

Also, I'd assume that while it isn't ideal, the current Scala/Dotty solution of creating a Range instance would have the same issue?

Depends what code the JIT manages to inline.

@neko-kai
Copy link
Contributor

neko-kai commented May 10, 2020

@danlehmann
You should be able to unpack a normal case class with inline match, e.g.:

final case class MyRange(from: Int, to: Int)

inline def (from: Int) to (to: Int): MyRange = MyRange(from, to)

inline def (inline myRange: MyRange).foreach(inline f: Int => Unit): Unit = {
  inline myRange match {
    case MyRange(from, to) => while(...) { ... }
  }
}

def test = for (i <- 0 to 5) { println(i) }

see http://dotty.epfl.ch/docs/reference/metaprogramming/inline.html#inline-matches

@danlehmann
Copy link
Author

@neko-kai

This is perfect, thanks a lot. This creates close to optimal code without any objects or even function calls.

This could be quite useful in case the inliner isn't able to resolve the general case by the time Scala 3.0 ships.

@neko-kai
Copy link
Contributor

It's possible to add an inline method overriding Range#foreach in Predef while keeping compatibility with Range type:

package example

import scala.quoted.unsafe.UnsafeExpr

object MyPredef {
  type MyRange <: Range.Inclusive & InlineForeach

  inline def (inline from: Int) to (inline to: Int): MyRange = new Range.Inclusive(from, to, 1).asInstanceOf[MyRange]

  sealed trait InlineForeach { this: MyRange =>
    inline final def foreach(inline f: Int => Unit): Unit = ${ macroImpl('this)('f) }
  }

  import quoted._
  def macroImpl(self: Expr[MyRange])(f: Expr[Int => Unit])(using qctx: QuoteContext): Expr[Unit] = {
    UnsafeExpr.underlyingArgument(self) match {
      case '{new Range.Inclusive($from, $to, 1).asInstanceOf[MyRange]} =>
        '{ inlineForeach($from, $to + 1, $f) }
    }
  }

  inline def inlineForeach(inline from: Int, inline end: Int, inline f: Int => Unit): Unit = {
    var i = from
    while (i < end) {
      f(i)
      i += 1
    }
  }

}


import MyPredef._

object App {
  def main(args: Array[String]): Unit = {
    for (i <- 1 to 5) do println(i)
  }
}

Although this version with tasty-reflect generates less optimal code than an inline match does:

    public void main(String[] args) {
        new Inclusive(1, 5, 1);

        for(int i = 1; i < 6; ++i) {
            ((ix) -> {
                .MODULE$.println(BoxesRunTime.boxToInteger(ix));
            }).apply(BoxesRunTime.boxToInteger(i));
        }

        BoxedUnit var10000 = BoxedUnit.UNIT;
    }

No idea how to eliminate the lambda and the redundant new Inclusive(1, 5, 1); at the start, it seems that unlike Scala 2 macros, dotty inline instance methods do not eliminate this when applied and I don't know how to work around that.

@nicolasstucki
Copy link
Contributor

The only sound way to demand the elimination of the prefix is using an extension method where you define the prefix as inline (or maybe by name). Otherwise, we can only rely on a backend optimizer (including JIT).

The use of UnsafeExpr.underlyingArgument may also introduce some unsoundness.

@neko-kai
Copy link
Contributor

Hmm, is there a way to unpack new Range.Inclusive($from, $to, 1) using inline match facilities without having to resort to a tasty-reflect macro? I tried a custom unapply but it didn't work.

@nicolasstucki
Copy link
Contributor

You would need to define the inline method as

  inline def (inline self: MyRange).foreach(inline f: Int => Unit): Unit = ${ macroImpl('self)('f) }

@neko-kai
Copy link
Contributor

@nicolasstucki That wouldn't work because there's already a member method Range#foreach in MyRange <: Range.Inclusive & InlineForeach that would be chosen instead of an extension method - extension method search only happens after a member method is not found, so it must be a member method - hence the cast to InlineForeach trait with a member inline method.

@nicolasstucki
Copy link
Contributor

Indeed. Maybe, in this case, we should MyRange not extended Range.Inclusive but have an implicit conversion to it.

Why do you need to have it fall back to Range.Inclusive anyway? If I have we have an InlineRange wouldn't we expect all the operations to be inlined (hence reimplemented)?

@neko-kai
Copy link
Contributor

Maybe, in this case, we should MyRange not extended Range.Inclusive but have an implicit conversion to it.
Why do you need to have it fall back to Range.Inclusive anyway? If I have we have an InlineRange wouldn't we expect all the operations to be inlined (hence reimplemented)?

Implicit conversions tend to fail in many cases where variance works well. Because I'm trying to show something that could work as part of DottyPredef, it must be maxiamlly source compatible and can't use implicit conversions. Although I guess if this issue was solved in dotty-library it would probably be easiest to just completely shadow scala.collection.immutable.Range and add inlines there directly.

@TheElectronWill
Copy link
Contributor

We have inline def, why not have inline class? Similar to inline def, an inline class would be inlined at call site. All def in an inline class would also have to be inline.

def efficientLoop =
  for x <- Range(0,10,1) do
    println(x)

inline class Range(val start: Int, val end: Int, val step: Int):
  inline def foreach(f: Int => Any): Unit =
    var i = start
    while i < end do
      f(i)
      i += step

would translate to something like

def efficientLoop =
  val r$start = 0
  val r$end = 10
  val r$step = 1
  var i = r$start
  while i < r$end do
    println(i)
    i += r$step

Of course, if you want to store a Range instance somewhere, then it must be allocated. This might make the keyword inappropriate, since inline in Scala 3 forces inlining, it's more than a hint. In that case, either we

  • accept the special treatment of object allocation (we can even emit a warning)
  • or we forbid it.

If the aforementioned limitation is too big or seems too weird, note that we don't even need inline class to be part of the language! The performance problem would be solved if the compiler could "inline classes" on its own, on a best-effort basis.

I've found a very old SIP about inline classes 😄 However, it is a little different, because it compiles methods to actual methods in the companion object.

Note that this is also different from AnyVal (which, AFAIK, could be improved when the value types of project Valhalla are released).


Alternative: special-case Range.foreach in some compiler phase to emit a while-loop instead. Not so fun nor interesting, but it would work.

@smarter
Copy link
Member

smarter commented Mar 17, 2022

The performance problem would be solved if the compiler could "inline classes" on its own, on a best-effort basis.

The Scala 2 backend optimizer (which hasn't been ported to Scala 3 so far) does this sort of best-effort inlining: https://www.lightbend.com/blog/scala-inliner-optimizer, but in practice it doesn't help a lot on top of what the JIT does, and it clashes with separate compilation so only end-user applications can enable it safely.

@TheElectronWill
Copy link
Contributor

TheElectronWill commented Mar 17, 2022

It seems that this optimizer cannot eliminate allocations.

def main(): Unit = {
  for (x <- 1 until 10) {
    println(x)
  }
}

turns, with scalac -opt:l:inline -opt-inline-from:**, into

  public void main() {
    boolean bool = true;
    int until$extension_end = 10;
    Range.Exclusive exclusive = new Range.Exclusive(bool, until$extension_end, 1);
    if (!exclusive.isEmpty()) {
      int foreach$mVc$sp_i = exclusive.start();
      while (true) {
        Integer integer = Integer.valueOf(foreach$mVc$sp_i);
        Object println_x = integer = null;
        Console$.MODULE$.println(println_x);
        println_x = null;
        if (foreach$mVc$sp_i != ((Range)exclusive).scala$collection$immutable$Range$$lastElement) {
          foreach$mVc$sp_i += exclusive.step();
          continue;
        } 
        break;
      } 
    } 
  }

The foreach is inlined to a while loop, which is much better than what the dotty backend currently produces, but still far from perfect.

In the case of for-loops it helps quite a lot, however. I guess that porting the backend is planned but still TODO?

@strelec
Copy link

strelec commented Jun 24, 2022

First of all, I do not think it is worth bothering with an inlining optimizer in Scala 3. JVM will do a better job at it, given the runtime characteristics.

What you should do instead, is make sure that the compiler generates JVM-friendly code.

In lampepfl/dotty-feature-requests#308 I give exactly this argument. If you use a bare for loop, JVM can do many optimizations. It can unroll your loops, it can unswtich your loop, It can special case first and last iteration, it can split loops in two or more.

I think you are throwing all these great optimizations out of the window by generating a code that uses foreach with lambda.

Desugaring of for compehension in the compiler is already governned by a spec (you desugar for into filter and map). Since the compiler already does something special to desugar the for comprehension, why not do this as well?

@TheElectronWill
Copy link
Contributor

TheElectronWill commented Jun 24, 2022

What you should do instead, is make sure that the compiler generates JVM-friendly code.

If you want the compiler to generate such JIT-compiler-friendly code in way more cases that just for x <- a to b, then what you need is kind of an inliner-optimizer (1) 😃

Of course, the spec could be modified to special-case that simple for-loop and declare that it translates to a while loop. It hasn't be done in Scala. And I think that the reason was to avoid special-casing things in the specification.

(1) Actually, what would be great isn't just method inlining but scalar replacement, i.e. removing the object allocation completely, as I mentioned earlier. The Hotspot's JIT compiler does not do this. GraalVM does this a bit, because it has an escape analysis phase, but not everyone runs on GraalVM.

@smarter
Copy link
Member

smarter commented Jun 24, 2022

Desugaring of for compehension in the compiler is already governned by a spec (you desugar for into filter and map). Since the compiler already does something special to desugar the for comprehension, why not do this as well?

Currently, the desugaring of for comprehension is purely syntactic (it happens before typechecking, on the untyped AST) but to decide whether we can safely optimize a for-comprehension into a for loop, we'd need to typecheck it first (otherwise we can't know if 1 to 2 is scala.Predef.intWrapper(1).to(2) as you'd expect or some other extension method named to currently in scope).

I don't think there's anything fundamental that prevents us from having a type-directed desugaring of for comprehensions, but it will require a fair amount of work. It might be worth it since it should make it easier to solve other long-standing issues with for-comprehensions (in particular, #2573 and re-implementing https://github.com/oleg-py/better-monadic-for#final-map-optimization)

@sjrd
Copy link
Member

sjrd commented Jun 24, 2022

A general purpose optimizer with inlining, scalar replacement and constant folding can optimizing away the range foreach into a while loop, without having to change anything in the spec.

I know this is possible because Scala.js does it.

@He-Pin
Copy link

He-Pin commented Dec 5, 2024

So this is why in https://github.com/bddicken/languages/ , Scala is slower than Kotlin.

//
// Source code recreated by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

//decompiled from code$package.class
package jvm;

public final class code$package {
    public static void main(String var0) {
        code$package$.MODULE$.main(var0);
    }
}

//decompiled from code$package$.class
package jvm;

import java.io.Serializable;
import java.lang.invoke.SerializedLambda;
import java.util.Random;
import scala.collection.StringOps.;
import scala.runtime.BoxesRunTime;
import scala.runtime.ModuleSerializationProxy;
import scala.runtime.java8.JFunction1;

public final class code$package$ implements Serializable {
    public static final code$package$ MODULE$ = new code$package$();

    private code$package$() {
    }

    private Object writeReplace() {
        return new ModuleSerializationProxy(code$package$.class);
    }

    public void main(final String number) {
        int u = .MODULE$.toInt$extension(scala.Predef..MODULE$.augmentString(number));
        int r = (new Random()).nextInt(10000);
        int[] a = new int[10000];
        scala.runtime.RichInt..MODULE$.until$extension(scala.Predef..MODULE$.intWrapper(0), 10000).foreach((JFunction1.mcVI.sp)(i) -> {
            scala.runtime.RichInt..MODULE$.until$extension(scala.Predef..MODULE$.intWrapper(0), 100000).foreach((JFunction1.mcVI.sp)(j) -> a[i] += j % u);
            a[i] += r;
        });
        scala.Predef..MODULE$.println(BoxesRunTime.boxToInteger(a[r]));
    }

    // $FF: synthetic method
    private static Object $deserializeLambda$(SerializedLambda var0) {
        return Class.lambdaDeserialize<invokedynamic>(var0);
    }
}

@OndrejSpanel
Copy link
Member

There was recently a related mention in #22006 hinting about possible Scala 3 solution:

Inlined override of foreach in Range so that we finally get fast C-style for loops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants