Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit a fiber on a single cache line #2473

Closed
wants to merge 3 commits into from
Closed

Fit a fiber on a single cache line #2473

wants to merge 3 commits into from

Conversation

vasilmkd
Copy link
Member

This is not production ready. And is also probably not the best combination of fields to achieve 64 bytes of IOFiber object footprint.

  1. I was examining bytecode and came across the following:
    private[this] val fields which are inside the class body generate a bitmap$init volatile field. I don't exactly know why. However, it can be circumvented (on all Scala versions) by putting the same private[this] val fields inside the primary constructor for the class. Thus, the series/3.x bytecode looks like the following:
private volatile int bitmap$init$0;

public cats.effect.IOFiber(scala.collection.immutable.Map<cats.effect.IOLocal<?>, java.lang.Object>, scala.Function1<cats.effect.kernel.Outcome<cats.effect.IO, java.lang.Throwable, A>, scala.runtime.BoxedUnit>, cats.effect.IO<A>, scala.concurrent.ExecutionContext, cats.effect.unsafe.IORuntime);
Code:
    0: aload_0
    1: aload         5
    3: putfield      #255                // Field runtime:Lcats/effect/unsafe/IORuntime;
    6: aload_0
    7: invokespecial #1193               // Method cats/effect/IOFiberPlatform."<init>":()V
    10: aload_0
    11: invokestatic  #1197               // InterfaceMethod cats/effect/kernel/Fiber.$init$:(Lcats/effect/kernel/Fiber;)V
    14: aload_0
    15: new           #401                // class cats/effect/ArrayStack
    18: dup
    19: invokespecial #1198               // Method cats/effect/ArrayStack."<init>":()V
    22: putfield      #399                // Field objectState:Lcats/effect/ArrayStack;
    25: aload_0
    26: aload_0
    27: getfield      #223                // Field bitmap$init$0:I
    30: iconst_1
    31: ior
    32: putfield      #223                // Field bitmap$init$0:I
    35: aload_0
    36: aload         4
    38: putfield      #265                // Field currentCtx:Lscala/concurrent/ExecutionContext;
    41: aload_0
    42: aload_0
    43: getfield      #223                // Field bitmap$init$0:I
    46: iconst_2
    47: ior
    48: putfield      #223                // Field bitmap$init$0:I
    51: aload_0
    52: iconst_0
    53: putfield      #465                // Field canceled:Z
    56: aload_0
    57: aload_0
    58: getfield      #223                // Field bitmap$init$0:I
    61: iconst_4
    62: ior
    63: putfield      #223                // Field bitmap$init$0:I
    66: aload_0
    67: iconst_0
    68: putfield      #480                // Field masks:I
    71: aload_0
    72: aload_0
    73: getfield      #223                // Field bitmap$init$0:I
    76: bipush        8
    78: ior
    79: putfield      #223                // Field bitmap$init$0:I
    82: aload_0
    83: iconst_0
    84: putfield      #502                // Field finalizing:Z
    87: aload_0
    88: aload_0
    89: getfield      #223                // Field bitmap$init$0:I
    92: bipush        16
    94: ior
    95: putfield      #223                // Field bitmap$init$0:I
    98: aload_0
    99: new           #401                // class cats/effect/ArrayStack
    102: dup
    103: invokespecial #1198               // Method cats/effect/ArrayStack."<init>":()V
    106: putfield      #470                // Field finalizers:Lcats/effect/ArrayStack;
    109: aload_0
    110: aload_0
    111: getfield      #223                // Field bitmap$init$0:I
    114: bipush        32
    116: ior
    117: putfield      #223                // Field bitmap$init$0:I
    120: aload_0
    121: new           #753                // class cats/effect/CallbackStack
    124: dup
    125: aload_2
    126: invokespecial #1201               // Method cats/effect/CallbackStack."<init>":(Lscala/Function1;)V
    129: putfield      #751                // Field callbacks:Lcats/effect/CallbackStack;
    132: aload_0
    133: aload_0
    134: getfield      #223                // Field bitmap$init$0:I
    137: bipush        64
    139: ior
    140: putfield      #223                // Field bitmap$init$0:I
    143: aload_0
    144: aload_1
    145: putfield      #610                // Field localState:Lscala/collection/immutable/Map;
    148: aload_0
    149: aload_0
    150: getfield      #223                // Field bitmap$init$0:I
    153: sipush        128
    156: ior
    157: putfield      #223                // Field bitmap$init$0:I
    160: aload_0
    161: iconst_0
    162: putfield      #180                // Field resumeTag:B
    165: aload_0
    166: aload_0
    167: getfield      #223                // Field bitmap$init$0:I
    170: sipush        256
    173: ior
    174: putfield      #223                // Field bitmap$init$0:I
    177: aload_0
    178: aload_3
    179: putfield      #263                // Field resumeIO:Lcats/effect/IO;
    182: aload_0
    183: aload_0
    184: getfield      #223                // Field bitmap$init$0:I
    187: sipush        512
    190: ior
    191: putfield      #223                // Field bitmap$init$0:I
    194: aload_0
    195: getstatic     #652                // Field cats/effect/IOFiber$.MODULE$:Lcats/effect/IOFiber$;
    198: invokevirtual #1204               // Method cats/effect/IOFiber$.RightUnit:()Lscala/util/Right;
    201: putfield      #782                // Field RightUnit:Lscala/util/Either;
    204: aload_0
    205: aload_0
    206: getfield      #223                // Field bitmap$init$0:I
    209: sipush        1024
    212: ior
    213: putfield      #223                // Field bitmap$init$0:I
    216: aload_0
    217: getstatic     #1206               // Field cats/effect/IO$EndFiber$.MODULE$:Lcats/effect/IO$EndFiber$;
    220: putfield      #253                // Field IOEndFiber:Lcats/effect/IO$EndFiber$;
    223: aload_0
    224: aload_0
    225: getfield      #223                // Field bitmap$init$0:I
    228: sipush        2048
    231: ior
    232: putfield      #223                // Field bitmap$init$0:I
    235: aload_0
    236: getstatic     #1211               // Field cats/effect/tracing/RingBuffer$.MODULE$:Lcats/effect/tracing/RingBuffer$;
    239: aload         5
    241: invokevirtual #1214               // Method cats/effect/unsafe/IORuntime.traceBufferLogSize:()I
    244: invokevirtual #1218               // Method cats/effect/tracing/RingBuffer$.empty:(I)Lcats/effect/tracing/RingBuffer;
    247: putfield      #447                // Field tracingEvents:Lcats/effect/tracing/RingBuffer;
    250: aload_0
    251: aload_0
    252: getfield      #223                // Field bitmap$init$0:I
    255: sipush        4096
    258: ior
    259: putfield      #223                // Field bitmap$init$0:I
    262: aload_0
    263: getstatic     #533                // Field cats/effect/IO$.MODULE$:Lcats/effect/IO$;
    266: aload_0
    267: invokedynamic #1224,  0           // InvokeDynamic #19:apply:(Lcats/effect/IOFiber;)Lscala/Function1;
    272: invokevirtual #1227               // Method cats/effect/IO$.uncancelable:(Lscala/Function1;)Lcats/effect/IO;
    275: putfield      #225                // Field cancel:Lcats/effect/IO;
    278: aload_0
    279: aload_0
    280: getfield      #223                // Field bitmap$init$0:I
    283: sipush        8192
    286: ior
    287: putfield      #223                // Field bitmap$init$0:I
    290: aload_0
    291: getstatic     #533                // Field cats/effect/IO$.MODULE$:Lcats/effect/IO$;
    294: aload_0
    295: invokedynamic #1231,  0           // InvokeDynamic #20:apply:(Lcats/effect/IOFiber;)Lscala/Function1;
    300: invokevirtual #629                // Method cats/effect/IO$.async:(Lscala/Function1;)Lcats/effect/IO;
    303: putfield      #239                // Field join:Lcats/effect/IO;
    306: aload_0
    307: aload_0
    308: getfield      #223                // Field bitmap$init$0:I
    311: sipush        16384
    314: ior
    315: putfield      #223                // Field bitmap$init$0:I
    318: return

while, the code in this PR looks like the following:

public cats.effect.IOFiber(int[], cats.effect.ArrayStack<java.lang.Object>, scala.concurrent.ExecutionContext, int, boolean, boolean, byte, cats.effect.IO<java.lang.Object>, cats.effect.ArrayStack<cats.effect.IO<scala.runtime.BoxedUnit>>, cats.effect.CallbackStack<A>, scala.collection.immutable.Map<cats.effect.IOLocal<?>, java.lang.Object>, cats.effect.tracing.RingBuffer, cats.effect.unsafe.IORuntime);
  Code:
      0: aload_0
      1: aload_1
      2: putfield      #427                // Field conts:[I
      5: aload_0
      6: aload_2
      7: putfield      #415                // Field objectState:Lcats/effect/ArrayStack;
    10: aload_0
    11: aload_3
    12: putfield      #285                // Field currentCtx:Lscala/concurrent/ExecutionContext;
    15: aload_0
    16: iload         4
    18: putfield      #496                // Field masks:I
    21: aload_0
    22: iload         5
    24: putfield      #481                // Field canceled:Z
    27: aload_0
    28: iload         6
    30: putfield      #518                // Field finalizing:Z
    33: aload_0
    34: iload         7
    36: putfield      #170                // Field resumeTag:B
    39: aload_0
    40: aload         8
    42: putfield      #283                // Field resumeIO:Lcats/effect/IO;
    45: aload_0
    46: aload         9
    48: putfield      #486                // Field finalizers:Lcats/effect/ArrayStack;
    51: aload_0
    52: aload         10
    54: putfield      #737                // Field callbacks:Lcats/effect/CallbackStack;
    57: aload_0
    58: aload         11
    60: putfield      #609                // Field localState:Lscala/collection/immutable/Map;
    63: aload_0
    64: aload         12
    66: putfield      #463                // Field tracingEvents:Lcats/effect/tracing/RingBuffer;
    69: aload_0
    70: aload         13
    72: putfield      #275                // Field runtime:Lcats/effect/unsafe/IORuntime;
    75: aload_0
    76: invokespecial #1176               // Method cats/effect/IOFiberPlatform."<init>":()V
    79: aload_0
    80: invokestatic  #1180               // InterfaceMethod cats/effect/kernel/Fiber.$init$:(Lcats/effect/kernel/Fiber;)V
    83: return

In other words, much less code in the constructor, and more importantly no writes to a volatile field. This is immediately noticeable in AsyncBenchmark.start

series/3.x:

Benchmark             (size)   Mode  Cnt     Score    Error  Units
AsyncBenchmark.start     100  thrpt   20  7401.313 ± 73.121  ops/s

This PR:

Benchmark             (size)   Mode  Cnt     Score     Error  Units
AsyncBenchmark.start     100  thrpt   20  8567.092 ± 193.809  ops/s
  1. Fibers now fit on a single cache line* (I think it's fair to not count the stacks, as they are necessarily on a different cache line due to how the JVM treats objects and arrays). Proof:
cats.effect.IOFiber object internals:
OFF  SZ                                TYPE DESCRIPTION               VALUE
  0   8                                     (object header: mark)     N/A
  8   4                                     (object header: class)    N/A
 12   4                                 int AtomicBoolean.value       N/A
 16   4                                 int IOFiber.masks             N/A
 20   1                             boolean IOFiber.canceled          N/A
 21   1                             boolean IOFiber.finalizing        N/A
 22   1                                byte IOFiber.resumeTag         N/A
 23   1                                     (alignment/padding gap)   
 24   4                               int[] IOFiber.conts             N/A
 28   4              cats.effect.ArrayStack IOFiber.objectState       N/A
 32   4   scala.concurrent.ExecutionContext IOFiber.currentCtx        N/A
 36   4                      cats.effect.IO IOFiber.resumeIO          N/A
 40   4              cats.effect.ArrayStack IOFiber.finalizers        N/A
 44   4           cats.effect.CallbackStack IOFiber.callbacks         N/A
 48   4      scala.collection.immutable.Map IOFiber.localState        N/A
 52   4      cats.effect.tracing.RingBuffer IOFiber.tracingEvents     N/A
 56   4        cats.effect.unsafe.IORuntime IOFiber.runtime           N/A
 60   4          cats.effect.kernel.Outcome IOFiber.outcome           N/A
Instance size: 64 bytes
Space losses: 1 bytes internal + 0 bytes external = 1 bytes total

Here are the comparative benchmarks:

series/3.x:

Benchmark                                  Mode  Cnt      Score    Error    Units
Benchmarks.catsEffect3Alloc               thrpt   20      4.493 ±  0.235  ops/min
Benchmarks.catsEffect3DeepBind            thrpt   20  16892.459 ± 37.020    ops/s
Benchmarks.catsEffect3EnqueueDequeue      thrpt   20    456.915 ±  2.558    ops/s
Benchmarks.catsEffect3RuntimeChainedFork  thrpt   20   3364.922 ± 55.158    ops/s
Benchmarks.catsEffect3RuntimeForkMany     thrpt   20    384.342 ±  2.905    ops/s
Benchmarks.catsEffect3RuntimePingPong     thrpt   20    949.036 ± 14.659    ops/s
Benchmarks.catsEffect3RuntimeYieldMany    thrpt   20    243.165 ± 99.287    ops/s
Benchmarks.catsEffect3Scheduling          thrpt   20     12.524 ±  0.567  ops/min

This PR:

Benchmark                                  Mode  Cnt      Score    Error    Units
Benchmarks.catsEffect3Alloc               thrpt   20      4.179 ±  0.067  ops/min
Benchmarks.catsEffect3DeepBind            thrpt   20  16459.395 ± 64.451    ops/s
Benchmarks.catsEffect3EnqueueDequeue      thrpt   20    440.580 ± 13.961    ops/s
Benchmarks.catsEffect3RuntimeChainedFork  thrpt   20   6853.426 ± 46.470    ops/s
Benchmarks.catsEffect3RuntimeForkMany     thrpt   20    743.996 ±  4.290    ops/s
Benchmarks.catsEffect3RuntimePingPong     thrpt   20    940.144 ± 15.750    ops/s
Benchmarks.catsEffect3RuntimeYieldMany    thrpt   20    284.461 ± 22.554    ops/s
Benchmarks.catsEffect3Scheduling          thrpt   20     12.814 ±  0.402  ops/min

The benchmarks were run using the following snapshot:

"io.vasilev" %% "cats-effect" % "3.3-399-71c3d2d"

I think there is something to this approach and we should pursue it more aggressively. I would really love to see the constructor changes merged. I will split that part out in a separate PR in the next few days.

@armanbilge
Copy link
Member

🤩 really amazing to watch this series of PRs ... also I might just switch to io.vasilev bootleg for all my typelevel needs 😉

Just a question: to achieve that 64 byte fiber size, you are (effectively) assuming the CompressedOops optimization right?

@vasilmkd
Copy link
Member Author

Yes, it's on by default since JDK 6 I believe. It's guaranteed for heaps up to 32 Gb in size, and I believe there's support for larger heaps, but don't quote me on that.

@armanbilge
Copy link
Member

Yes I think you can fiddle with the object alignment, but YMMV. Maybe it's worth it for up to 64 GB or something, idk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants