Feature suggestion (memcpy/memset) #236

lskyum · 2015-06-29T13:15:53Z

First of all - sorry if I missed it in the spec!

The suggestion is for simple memcpy/memset instructions, so copying/clearing can be done at maximum speed for any given platform/browser. For example bounds-checking doesn't have to be done pr. read/write, but only once. (Could be an issue on some implementations I think. Maybe also the polyfill...)

The memcpy could be provided in two versions, one that checks for overlaps and one that doesn't.

cristiano-belloni · 2015-06-29T13:29:21Z

Aren't memcpy / memset / memmove usually implemented at an higher level, ie in the C stdlib? Example implememtation in XNU: http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/libsyscall/wrappers/memcpy.c

waywardmonkeys · 2015-06-29T13:35:52Z

In emscripten, memset and memcpy can result in pretty bloaty code, especially when compared to native code and what a loop expands into. (We suffer a good bit from things related to this in our code base that is compiled with emscripten.) I don't know how that would be in WebAssembly or if it would be less bloaty.

MikeHolman · 2015-06-29T13:51:33Z

Interesting thought.

You might be able to rely on the wasm engine recognizing this pattern and optimizing the loop/removing bound checks (our engine is starting to do some optimizations when it recognizes the "memcpy" pattern in js). And when we have SIMD support we can implement memcpy pretty efficiently in wasm without as much reliance on optimizations from the wasm engine. To this end, it would be interesting to see how much SIMD helps memcpy in asm.js.

I think this should be less bloated in wasm (even in asm.js I think we could probably just not inline memcpy and clear up the code bloat -- I'd think in most cases inlining them won't really improve perf because call overhead would be dwarfed by the actual copy).

But it might be worth considering because it could help perf for baseline compiler (which might not have BCE), and it could reduce AST size (obviously important).

waywardmonkeys · 2015-06-29T13:53:37Z

@MikeHolman The issue with not inlining in emscripten is that various patterns of code get converted to llvm.memcpy* and llvm.memset* calls ... this happens in constructors and other locations, as well as (sometimes) LLVM looking at a loop and deciding it can be converted to one of these calls.

sunfishcode · 2015-06-29T16:03:45Z

WebAssembly does have fewer pain points than asm.js in this area, but there would still be advantages to adding these.

For memcpy, as opposed to memmove, it's not clear that introducing nondeterminism for the operlapping case is worth the minor performance difference, and we can always add it in the future if we wish.

jfbastien · 2015-06-29T17:02:03Z

Agreed on adding memset and memmove intrinsics:

Different architectures want different code (even different x86 CPUs want something different), and I'm comfortable letting the developer-side compiler (e.g. LLVM) peephole. Having the WebAssembly compiler detect these patterns isn't something I'd advise for now.
Agreed with @sunfishcode that memcpy could be added after memmove if data shows we need it.

How should we define intrinsics in WebAssembly is the next question. Unresolved external functions with a special name? That's getting dangerously close to having some limited ABI, but I'm OK in this case.

lskyum · 2015-06-29T17:13:05Z

An improvement to the idea could be memcpy4/memcpy8/memcpy16 that only allows a multiple of 4/8/16 bytes to be copied, so some checks could be skipped. For example when using SIMD instructions for copying.

sunfishcode · 2015-06-29T17:19:45Z

@jfbastien If we add new builtin operations, we should just add them as plain AST node kinds. I don't think we need an instruction/intrinsic distinction as LLVM has.

@lskyum I agree that some provision for copying with known size multiples, and known alignments too for that matter, is worth considering.

jfbastien · 2015-06-29T17:22:00Z

@sunfishcode AST nodes sound good.

On known size / alignment: what's the behavior when the code was incorrect? Force realign+overcopy, trap, implementation defined?

sunfishcode · 2015-06-29T17:26:22Z

For the size multiples, one possibility would be to have a scale operand, possibly required to be a literal, that the size operand would be implicitly shifted by. For alignment, perhaps we could just do what load+store do and say "it's really slow" if the given alignment is wrong.

titzer · 2015-06-30T07:39:58Z

I think memmove would be OK, but I would prefer not to have many variants
for different sizes and different alignment cases. The primitive of "copy
these many bytes from here to here" is plenty expressive, and the engine
should do optimizations when it detects either statically or dynamically
that an optimized implementation would be better.

On Mon, Jun 29, 2015 at 7:26 PM, Dan Gohman [email protected]
wrote:

For the size multiples, one possibility would be to have a scale operand,
possibly required to be a literal, that the size operand would be
implicitly shifted by. For alignment, perhaps we could just do what
load+store do and say "it's really slow" if the given alignment is wrong.

—
Reply to this email directly or view it on GitHub
#236 (comment).

lskyum · 2015-07-01T11:36:59Z

Just so it won't be forgotten. A memset (or memclear) would be usefull for implementing a C#/Java VM because all objects and arrays are initialized to 0/null on creation.

sunfishcode · 2015-07-28T03:47:48Z

I am contemplating a proposal as follows:

  int32 memmove32(int32 dst, int32 src, int32 size);
  int32 memzero32(int32 dst, int32 size);

Semantics are hopefully obvious :-). With >4GiB heaps, 64-bit versions would be added too. I'm now leaning away from size/alignment hints for now because I don't think they're the most critical, and there's value in starting simple. Return value is the value of dst, and is there because it allows greater expression tree formation. Comments?

lskyum · 2015-07-28T06:19:46Z

I understand the concerns with nondeterminism, but I can't help thinking that memcpy would be usefull for copying structs around without having to do overlap checking. I think it could replace many load/store operations and allow them to be optimized as well - especially when the size is a constant.

titzer · 2015-07-28T08:27:02Z

Checks for overlapping memory regions probably aren't that expensive in
comparison to the actual copy operation. Since memcpy would be an
optimization on top of the more semantically well-defined memmove, it
should be motivated by performance data.

On Tue, Jul 28, 2015 at 8:20 AM, lskyum [email protected] wrote:

I understand the concerns with nondeterminism, but I can't help thinking
that memcpy would be usefull for copying structs around without having to
do overlap checking. I think it could replace many load/store operations
and allow them to be optimized as well - especially when the size is a
constant.

—
Reply to this email directly or view it on GitHub
#236 (comment).

titzer · 2015-07-28T08:28:24Z

On Tue, Jul 28, 2015 at 5:48 AM, Dan Gohman [email protected]
wrote:

I am contemplating a proposal as follows:

int32 memmove32(int32 dst, int32 src, int32 size);
int32 memzero32(int32 dst, int32 size);

I assume these operations work on bytes and that the "32" in "memmove32"
refers to the offsets being 32 bits?

Semantics are hopefully obvious :-). With >4GiB heaps, 64-bit versions
would be added too. I'm now leaning away from size/alignment hints for now
because I don't think they're the most critical, and there's value in
starting simple. Return value is the value of dst, and is there because
it allows greater expression tree formation. Comments?

—
Reply to this email directly or view it on GitHub
#236 (comment).

lskyum · 2015-07-28T08:38:02Z

@titzer allright I'll take your word for it :-)

qwertie · 2015-07-28T14:20:35Z

The overlap check wouldn't be expensive relative to the copy? If we're talking about structs, they might be 12, 16, 20, 24 bytes... memmove requires, what, two adds, two comparisons, two conditional branches and an unconditional branch? Sounds expensive. Edit: Oops, I guess overlap only changes the copy direction in one case, so only one add/comparison is needed.

sunfishcode · 2015-07-28T15:55:24Z

@titzer Yes, the 32 is for 32-bit operands. All sizes and addresses are in bytes here.

@qwertie Small known-size known-alignment copies like that can be done with plain loads and stores (and no overlap check) rather than memmove at the wasm level, when speed is a concern. Also, I believe the door is still open to adding size, alignment, and perhaps even aliasing hints in the future. I just don't want to block solving the medium/large cases on design and agreement on the small/partially-known cases.

jfbastien · 2015-07-28T18:12:38Z

@sunfishcode: I find the 32 confusing.

Agreed with @sunfishcode: small copies should be generated by the compiler as something else than memmove.

@sunfishcode: add memset?

kg · 2015-07-28T18:21:20Z

I would actually strongly advocate in favor of arbitrary-sized copies being emitted as memmove with constant sizes instead of arbitrary sets of loads/stores. That lets the runtime intelligently figure out the best way to copy N bytes. The alternative has proven to be a hellpit in JavaScript.

I think it will pretty much always be smaller on the wire and less work to decode, also.

sunfishcode · 2015-07-28T18:21:50Z

@jfbastien The 32 is just meant to leave namespace room for a 64 version in the future. I'm open to suggestions.

@jfbastien memzero covers the vast majority of memset use cases.

sunfishcode · 2015-07-28T18:38:04Z

@kg Would you be ok if we just do this simple memmove/memclear proposal first, get consensus that we should do something first, and then extend it with size/alignment/aliasing/locality hints in a follow-up issue?

kg · 2015-07-28T18:45:53Z

Yes, punting on it is totally fine. Just responding to Small known-size known-alignment copies like that can be done with plain loads and stores (and no overlap check) rather than memmove at the wasm level. I don't think you need a size hint, the runtime can just detect that the size argument is known-constant?

jfbastien · 2015-07-28T18:46:07Z

memset covers more use cases than memzero, and is easy to transform to memzero when provided with a constant zero value. It's as simple, yet more powerful, so Extensible Web Manifesto is yes.

sunfishcode · 2015-07-28T19:14:26Z

@jfbastien And memset_pattern covers even more use cases than memset, because single-byte patterns are limiting! Where do we draw the line? Memzero is the simplest thing which solves the problem we're trying to solve here. Bulk-zeroing memory is something even the most minimal implementations are going to need to be able to do anyway.

@kg Right, if you have a constant size, then you just need alignment of the source, alignment of the destination (should we have separate hints, or one for both?) and maybe also non-overlappingness of source and destination.

jfbastien · 2015-07-28T20:57:34Z

@sunfishcode that's a pretty strawman argument...

MikeHolman · 2017-03-01T08:44:10Z

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things.

If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

lukewagner · 2017-03-01T14:29:51Z

Yes, we've seen that in several cases as well so agreed on prioritizing for next batch of features.

jfbastien · 2017-05-11T18:16:59Z

Addressed by #1057.

MikeHolman added the enhancement label Jun 29, 2015

sunfishcode self-assigned this Jul 28, 2015

sunfishcode mentioned this issue Jul 28, 2015

Define memmove and memzero operators. #284

Closed

sunfishcode added this to the Future Features milestone Oct 22, 2015

sunfishcode removed the enhancement label Jul 12, 2016

sunfishcode removed their assignment Jan 13, 2017

jfbastien mentioned this issue Feb 2, 2017

Do we need a memcpy opcode? #977

Closed

pipcet mentioned this issue Mar 1, 2017

Compile-time intrinsic functions #1003

Open

binji mentioned this issue May 9, 2017

Add Memset+Memcpy to FutureFeatures.md #1057

Merged

jfbastien closed this as completed May 11, 2017

binji mentioned this issue Jul 24, 2017

🛤 Bulk memory operations #1114

Closed

binji mentioned this issue Feb 1, 2018

mem.clear vs mem.fill WebAssembly/bulk-memory-operations#7

Closed

binji mentioned this issue Oct 16, 2018

🛤 Bulk memory operations WebAssembly/proposals#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature suggestion (memcpy/memset) #236

Feature suggestion (memcpy/memset) #236

lskyum commented Jun 29, 2015

cristiano-belloni commented Jun 29, 2015

waywardmonkeys commented Jun 29, 2015

MikeHolman commented Jun 29, 2015

waywardmonkeys commented Jun 29, 2015

sunfishcode commented Jun 29, 2015

jfbastien commented Jun 29, 2015

lskyum commented Jun 29, 2015

sunfishcode commented Jun 29, 2015

jfbastien commented Jun 29, 2015

sunfishcode commented Jun 29, 2015

titzer commented Jun 30, 2015

lskyum commented Jul 1, 2015

sunfishcode commented Jul 28, 2015

lskyum commented Jul 28, 2015

titzer commented Jul 28, 2015

titzer commented Jul 28, 2015

lskyum commented Jul 28, 2015

qwertie commented Jul 28, 2015

sunfishcode commented Jul 28, 2015

jfbastien commented Jul 28, 2015

kg commented Jul 28, 2015

sunfishcode commented Jul 28, 2015

sunfishcode commented Jul 28, 2015

kg commented Jul 28, 2015

jfbastien commented Jul 28, 2015

sunfishcode commented Jul 28, 2015

jfbastien commented Jul 28, 2015

MikeHolman commented Mar 1, 2017

lukewagner commented Mar 1, 2017

jfbastien commented May 11, 2017