Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestion (memcpy/memset) #236

Closed
lskyum opened this issue Jun 29, 2015 · 30 comments
Closed

Feature suggestion (memcpy/memset) #236

lskyum opened this issue Jun 29, 2015 · 30 comments

Comments

@lskyum
Copy link

lskyum commented Jun 29, 2015

First of all - sorry if I missed it in the spec!

The suggestion is for simple memcpy/memset instructions, so copying/clearing can be done at maximum speed for any given platform/browser. For example bounds-checking doesn't have to be done pr. read/write, but only once. (Could be an issue on some implementations I think. Maybe also the polyfill...)

The memcpy could be provided in two versions, one that checks for overlaps and one that doesn't.

@cristiano-belloni
Copy link

Aren't memcpy / memset / memmove usually implemented at an higher level, ie in the C stdlib? Example implememtation in XNU: http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/libsyscall/wrappers/memcpy.c

@waywardmonkeys
Copy link

In emscripten, memset and memcpy can result in pretty bloaty code, especially when compared to native code and what a loop expands into. (We suffer a good bit from things related to this in our code base that is compiled with emscripten.) I don't know how that would be in WebAssembly or if it would be less bloaty.

@MikeHolman
Copy link
Member

Interesting thought.

You might be able to rely on the wasm engine recognizing this pattern and optimizing the loop/removing bound checks (our engine is starting to do some optimizations when it recognizes the "memcpy" pattern in js). And when we have SIMD support we can implement memcpy pretty efficiently in wasm without as much reliance on optimizations from the wasm engine. To this end, it would be interesting to see how much SIMD helps memcpy in asm.js.

I think this should be less bloated in wasm (even in asm.js I think we could probably just not inline memcpy and clear up the code bloat -- I'd think in most cases inlining them won't really improve perf because call overhead would be dwarfed by the actual copy).

But it might be worth considering because it could help perf for baseline compiler (which might not have BCE), and it could reduce AST size (obviously important).

@waywardmonkeys
Copy link

@MikeHolman The issue with not inlining in emscripten is that various patterns of code get converted to llvm.memcpy* and llvm.memset* calls ... this happens in constructors and other locations, as well as (sometimes) LLVM looking at a loop and deciding it can be converted to one of these calls.

@sunfishcode
Copy link
Member

WebAssembly does have fewer pain points than asm.js in this area, but there would still be advantages to adding these.

For memcpy, as opposed to memmove, it's not clear that introducing nondeterminism for the operlapping case is worth the minor performance difference, and we can always add it in the future if we wish.

@jfbastien
Copy link
Member

Agreed on adding memset and memmove intrinsics:

  • Different architectures want different code (even different x86 CPUs want something different), and I'm comfortable letting the developer-side compiler (e.g. LLVM) peephole. Having the WebAssembly compiler detect these patterns isn't something I'd advise for now.
  • Agreed with @sunfishcode that memcpy could be added after memmove if data shows we need it.

How should we define intrinsics in WebAssembly is the next question. Unresolved external functions with a special name? That's getting dangerously close to having some limited ABI, but I'm OK in this case.

@lskyum
Copy link
Author

lskyum commented Jun 29, 2015

An improvement to the idea could be memcpy4/memcpy8/memcpy16 that only allows a multiple of 4/8/16 bytes to be copied, so some checks could be skipped. For example when using SIMD instructions for copying.

@sunfishcode
Copy link
Member

@jfbastien If we add new builtin operations, we should just add them as plain AST node kinds. I don't think we need an instruction/intrinsic distinction as LLVM has.

@lskyum I agree that some provision for copying with known size multiples, and known alignments too for that matter, is worth considering.

@jfbastien
Copy link
Member

@sunfishcode AST nodes sound good.

On known size / alignment: what's the behavior when the code was incorrect? Force realign+overcopy, trap, implementation defined?

@sunfishcode
Copy link
Member

For the size multiples, one possibility would be to have a scale operand, possibly required to be a literal, that the size operand would be implicitly shifted by. For alignment, perhaps we could just do what load+store do and say "it's really slow" if the given alignment is wrong.

@titzer
Copy link

titzer commented Jun 30, 2015

I think memmove would be OK, but I would prefer not to have many variants
for different sizes and different alignment cases. The primitive of "copy
these many bytes from here to here" is plenty expressive, and the engine
should do optimizations when it detects either statically or dynamically
that an optimized implementation would be better.

On Mon, Jun 29, 2015 at 7:26 PM, Dan Gohman [email protected]
wrote:

For the size multiples, one possibility would be to have a scale operand,
possibly required to be a literal, that the size operand would be
implicitly shifted by. For alignment, perhaps we could just do what
load+store do and say "it's really slow" if the given alignment is wrong.


Reply to this email directly or view it on GitHub
#236 (comment).

@lskyum
Copy link
Author

lskyum commented Jul 1, 2015

Just so it won't be forgotten. A memset (or memclear) would be usefull for implementing a C#/Java VM because all objects and arrays are initialized to 0/null on creation.

@sunfishcode
Copy link
Member

I am contemplating a proposal as follows:

  int32 memmove32(int32 dst, int32 src, int32 size);
  int32 memzero32(int32 dst, int32 size);

Semantics are hopefully obvious :-). With >4GiB heaps, 64-bit versions would be added too. I'm now leaning away from size/alignment hints for now because I don't think they're the most critical, and there's value in starting simple. Return value is the value of dst, and is there because it allows greater expression tree formation. Comments?

@sunfishcode sunfishcode self-assigned this Jul 28, 2015
@lskyum
Copy link
Author

lskyum commented Jul 28, 2015

I understand the concerns with nondeterminism, but I can't help thinking that memcpy would be usefull for copying structs around without having to do overlap checking. I think it could replace many load/store operations and allow them to be optimized as well - especially when the size is a constant.

@titzer
Copy link

titzer commented Jul 28, 2015

Checks for overlapping memory regions probably aren't that expensive in
comparison to the actual copy operation. Since memcpy would be an
optimization on top of the more semantically well-defined memmove, it
should be motivated by performance data.

On Tue, Jul 28, 2015 at 8:20 AM, lskyum [email protected] wrote:

I understand the concerns with nondeterminism, but I can't help thinking
that memcpy would be usefull for copying structs around without having to
do overlap checking. I think it could replace many load/store operations
and allow them to be optimized as well - especially when the size is a
constant.


Reply to this email directly or view it on GitHub
#236 (comment).

@titzer
Copy link

titzer commented Jul 28, 2015

On Tue, Jul 28, 2015 at 5:48 AM, Dan Gohman [email protected]
wrote:

I am contemplating a proposal as follows:

int32 memmove32(int32 dst, int32 src, int32 size);
int32 memzero32(int32 dst, int32 size);

I assume these operations work on bytes and that the "32" in "memmove32"
refers to the offsets being 32 bits?

Semantics are hopefully obvious :-). With >4GiB heaps, 64-bit versions
would be added too. I'm now leaning away from size/alignment hints for now
because I don't think they're the most critical, and there's value in
starting simple. Return value is the value of dst, and is there because
it allows greater expression tree formation. Comments?


Reply to this email directly or view it on GitHub
#236 (comment).

@lskyum
Copy link
Author

lskyum commented Jul 28, 2015

@titzer allright I'll take your word for it :-)

@qwertie
Copy link

qwertie commented Jul 28, 2015

The overlap check wouldn't be expensive relative to the copy? If we're talking about structs, they might be 12, 16, 20, 24 bytes... memmove requires, what, two adds, two comparisons, two conditional branches and an unconditional branch? Sounds expensive. Edit: Oops, I guess overlap only changes the copy direction in one case, so only one add/comparison is needed.

@sunfishcode
Copy link
Member

@titzer Yes, the 32 is for 32-bit operands. All sizes and addresses are in bytes here.

@qwertie Small known-size known-alignment copies like that can be done with plain loads and stores (and no overlap check) rather than memmove at the wasm level, when speed is a concern. Also, I believe the door is still open to adding size, alignment, and perhaps even aliasing hints in the future. I just don't want to block solving the medium/large cases on design and agreement on the small/partially-known cases.

@jfbastien
Copy link
Member

@sunfishcode: I find the 32 confusing.

Agreed with @sunfishcode: small copies should be generated by the compiler as something else than memmove.

@sunfishcode: add memset?

@kg
Copy link
Contributor

kg commented Jul 28, 2015

I would actually strongly advocate in favor of arbitrary-sized copies being emitted as memmove with constant sizes instead of arbitrary sets of loads/stores. That lets the runtime intelligently figure out the best way to copy N bytes. The alternative has proven to be a hellpit in JavaScript.

I think it will pretty much always be smaller on the wire and less work to decode, also.

@sunfishcode
Copy link
Member

@jfbastien The 32 is just meant to leave namespace room for a 64 version in the future. I'm open to suggestions.

@jfbastien memzero covers the vast majority of memset use cases.

@sunfishcode
Copy link
Member

@kg Would you be ok if we just do this simple memmove/memclear proposal first, get consensus that we should do something first, and then extend it with size/alignment/aliasing/locality hints in a follow-up issue?

@kg
Copy link
Contributor

kg commented Jul 28, 2015

Yes, punting on it is totally fine. Just responding to Small known-size known-alignment copies like that can be done with plain loads and stores (and no overlap check) rather than memmove at the wasm level. I don't think you need a size hint, the runtime can just detect that the size argument is known-constant?

@jfbastien
Copy link
Member

memset covers more use cases than memzero, and is easy to transform to memzero when provided with a constant zero value. It's as simple, yet more powerful, so Extensible Web Manifesto is yes.

@sunfishcode
Copy link
Member

@jfbastien And memset_pattern covers even more use cases than memset, because single-byte patterns are limiting! Where do we draw the line? Memzero is the simplest thing which solves the problem we're trying to solve here. Bulk-zeroing memory is something even the most minimal implementations are going to need to be able to do anyway.

@kg Right, if you have a constant size, then you just need alignment of the source, alignment of the destination (should we have separate hints, or one for both?) and maybe also non-overlappingness of source and destination.

@jfbastien
Copy link
Member

@sunfishcode that's a pretty strawman argument...

@sunfishcode sunfishcode added this to the Future Features milestone Oct 22, 2015
@sunfishcode sunfishcode removed their assignment Jan 13, 2017
@MikeHolman
Copy link
Member

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things.

If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

@lukewagner
Copy link
Member

Yes, we've seen that in several cases as well so agreed on prioritizing for next batch of features.

@jfbastien
Copy link
Member

Addressed by #1057.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants