WebAssembly and emscripten headers #97

loretoparisi · 2023-03-13T17:27:58Z

Hello I have tried a minimal Emscripten support to Makefile adding

# WASM
EMCXX = em++
EMCC = emcc
EMCXXFLAGS = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./
EMCCFLAGS = --bind -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./ 

EMOBJS = utils.bc ggml.bc

wasm: llama_wasm.js quantize_wasm.js
wasmdebug: export EMCC_DEBUG=1
wasmdebug: llama_wasm.js quantize_wasm.js

#
# WASM lib
#

ggml.bc: ggml.c ggml.h
	$(EMCC) -c $(EMCCFLAGS) ggml.c -o ggml.bc
utils.bc: utils.cpp utils.h
	$(EMCXX) -c $(EMCXXFLAGS) utils.cpp -o utils.bc

$(info I EMOBJS:      $(EMOBJS))

#
# WASM executable
#
llama_wasm.js: $(EMOBJS) main.cpp Makefile
	$(EMCXX) $(EMCXXFLAGS) $(EMOBJS) -o llama_wasm.js
quantize_wasm.js: $(EMOBJS) quantize.cpp Makefile
	$(EMCXX) $(EMCXXFLAGS) $(EMOBJS) quantize.cpp -o quantize_wasm.js

It complies ok with both em++ and emcc. At this stage the problem is that main.cpp and quantize.cpp does not expose a proper header file, and I cannot call main as a module, or a function export using Emscripten EMSCRIPTEN_KEEPALIVE to main by example.

In fact a simple C++ headers could be compiled as a node module and then called like

/** file:llama.js */
const llamaModularized = require('./llama_wasm.js');
var llamaModule = null
const _initLLAMAModule = async function () {
    llamaModule = await llamaModularized();
    return true
}
let postRunFunc = null;
const addOnPostRun = function (func) {
    postRunFunc = func;
};
_initLLAMAModule().then((res) => {
    if (postRunFunc) {
        postRunFunc();
    }
});

class LLaMa {
    constructor() {
        this.f = new llamaModule.LLaMa();
    }
    // here modules fun impl
}

module.exports = { LLaMa, addOnPostRun };

and then executed in node scripts like

/** file:run.js */
(async () => {
    const LLaMa = require('./llama.js');
    const loadWASM = function () {
        var self = this;
        return new Promise(function (resolve, reject) {
            LLaMa.addOnPostRun(() => {
                let model = new LLaMa.LLaMa();
                /** use model functions */
            });
        });
    }//loadWASM
    await loadWASM();

}).call(this);

The text was updated successfully, but these errors were encountered:

MarkSchmidty · 2023-03-13T17:44:41Z

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

Dicklesworthstone · 2023-03-13T18:15:02Z

If you quantized the 7B model to a mixture of 3-bit and 4-bit quantization using https://github.com/qwopqwop200/GPTQ-for-LLaMa then you could stay within that memory envelope.

MarkSchmidty · 2023-03-13T18:30:48Z

I think that's a reasonable proposal @Dicklesworthstone.

A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for implementing GPTQ quantization in 3-bit and 4-bit. GPTQ Quantization (3-bit and 4-bit) #9.

Other use cases could benefit from this same enhancement, such as getting 65B under 32GB and 30B under 16GB to further extend access to (perhaps slightly weaker versions of) the larger models.

thypon · 2023-03-20T16:55:14Z

https://twitter.com/nJoyneer/status/1637863946383155220

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
 
 clean:
-	rm -f *.o main quantize
+	rm -f *.o main.{html,wasm,js,data,worker.js} main quantize
 
 main: main.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
 	./main -h
 
+main.html: main.cpp ggml.o utils.o
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+	go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
 
diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif
 
+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)
 
diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();
 
     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";
 
     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;

loretoparisi · 2023-03-20T17:08:29Z

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
 
 clean:
-	rm -f *.o main quantize
+	rm -f *.o main.{html,wasm,js,data,worker.js} main quantize
 
 main: main.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
 	./main -h
 
+main.html: main.cpp ggml.o utils.o
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+	go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
 
diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif
 
+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)
 
diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();
 
     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";
 
     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;

Wow well done! Why did you had to remove abort(); from ggml?

thypon · 2023-03-20T17:11:43Z

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown.
Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

loretoparisi · 2023-03-20T17:25:33Z

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

thypon · 2023-03-20T17:29:39Z

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

It's already quantized 4bits when converting. 7b overflows 8GB of allocated WASM64 memory though.
Besides that it's quite slow since there is not both (pthread OR simd) AND memory64. Whenever I try to mix any of the two, the compiler or the linter fails to create or run the output.

okpatil4u · 2023-04-19T07:52:04Z

@thypon apparently memory64 is available in firefox nightly, did you check it ?

https://webassembly.org/roadmap/#feature-note-2

lapo-luchini · 2023-05-09T22:01:26Z

The new RedPajama-3B seems like a nice tiny model that could probably fit without memory64.

IsaacRe · 2023-05-17T17:59:05Z

@thypon @loretoparisi I'm curious, what sort of performance drop did you notice running in browser from running natively? How many toks/sec were you getting?

thypon · 2023-06-07T12:22:27Z

@IsaacRe Did not make a performance comparison since it was not 100% stable and needed to be refined. As mentioned it was single core since multithreaded + memory64 on firefox nightly was not working properly together, and crashing the experiment.

@okpatil4u already running with experimental memory64

okpatil4u · 2023-07-04T03:27:49Z

Hey @thypon, did you make any progress on this experiment ?

thypon · 2023-07-04T17:20:35Z

I'm not actively working on this at the current stage.

lukestanley · 2023-07-04T19:55:30Z

@okpatil4u
I broadly followed the very useful steps above by @loretoparisi and was able to run really small models only with latest Emscripten and a fairly recent master commit. I am way out of my comfort zone with C++ or WASM (since I spent most of my time with Typescript and Python). I didn't get around to installing Firefox nightly and stopped for now.
I last had the tiny Shakespeare models running in the browser.
The diff by loretoparisi was made a while ago so I had to make some significant changes and I am a complete C++ noob, so take this with a bag of salt, but if it helps someone, great:
lukestanley@41cbd2b

rahuldshetty · 2023-07-21T15:40:58Z

I've tried the approach suggested by @lukestanley and @loretoparisi and got starcoder.cpp to run on browser.
Published a demo project on this: https://github.com/rahuldshetty/starcoder.js

I tried with tiny_starcoder_py model as the weight size were quite small to fit without mem64, and tried to see the performance/accuracy. It seems like the output of the model without mem64 is gibberish while mem64 version results in meaningful output. Not sure if memory addressing in 32bit vs 64bit has to do with it.

mindplay-dk · 2023-09-16T15:13:12Z

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

How about WebGPU? Probably better to run it off-CPU where possible anyhow?

(full disclosure: I have no idea what I'm talking about.)

mohamedmansour · 2023-09-24T02:44:38Z

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

The implementation status is complete for emscripten:
https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

github-actions · 2024-04-10T01:08:09Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

loretoparisi · 2024-04-10T09:55:17Z

Not sure what is the progress here, apparently there are overlapping or related opened issues.

ggerganov · 2024-04-10T10:07:33Z

There is this project that might be relevant: https://github.com/ngxson/wllama

flatsiedatsie · 2024-04-12T15:28:52Z

@ggerganov Thanks for sharing that. I'm already using https://github.com/tangledgroup/llama-cpp-wasm as the basis of a big project.

So far llama-cpp-wasm has allowed me to run pretty much any .gguf that is less than 2GB in size in the browser (and that limitation seems to be related to the caching mechanism of that project, so I suspect the real limit would be 4GB).

People talk about bringing AI to the masses, but the best way to do that is with browser-based technology. My mom is never going to install Ollama and the like.

ggerganov · 2024-04-12T15:37:09Z

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

flatsiedatsie · 2024-04-12T20:38:57Z

She's already doing it :-)

Sneak preview:

(100% browser based)

loretoparisi · 2024-04-24T16:34:34Z

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

Agreed, the best example so far is LLM MLC, web version:
https://webllm.mlc.ai/

you can see that it can download 4GB in shards like 20 shards or so for a Llama-2 7B weights, 4-bit quantized. Of course this means that you can wait from tens seconds to few minutes to start the inference. And this is not going to change soon unless the quantization at 3,2bits works better and the accuracy is good as the 4bit...

By example, if we take Llama-2, 8B we have 108 shards

and it took 114 seconds to complete on my fiber channel:

before being ready to infer

on Mac M1 Pro I get

prefill: 13.5248 tokens/sec, decoding: 7.9857 tokens/sec
Models with “-1k” suffix signify 1024 context length, lowering ~2-3GB VRAM requirement compared to their counterparts. Feel free to start trying with those.

flatsiedatsie · 2024-04-24T19:03:26Z

Huggingface has recently released a streaming option for GGUF, where you can already start inference even though the model is noy fully loaded yet. At least, that's my understanding from a recent Youtube video by Yannic Kilcher.

For my project I'm trying to use a less than 2Gb quant of Phi 2 with 128K context. I think that model will become the best model for browser-based used for a while.

slaren · 2024-04-24T19:18:51Z

You may be thinking of a library that Huggingface released that can read GGUF metadata without downloading the whole file. You wouldn't gain much from streaming the model for inference, generally the entire model is necessary to generate every token.

flatsiedatsie · 2024-04-24T19:44:50Z

@slaren Ah, thanks for clarifying that. It sounded a little too good to be true :-)

This was referenced Mar 13, 2023

Raspberry Pi 4 4GB #58

Closed

GPTQ Quantization (3-bit and 4-bit) #9

Closed

gjmulder added the enhancement New feature or request label Mar 15, 2023

westurner mentioned this issue Sep 1, 2023

Instructions on how to build a wasm ggml. ggerganov/ggml#419

Closed

Jdo300 mentioned this issue Oct 23, 2023

Trouble running llama.cpp compiled for OpenMPI #3752

Closed

ei-grad mentioned this issue Oct 30, 2023

[Feature Request] Dawn C++ WebGPU backend #837

Closed

lxe mentioned this issue Dec 10, 2023

[Build] Offer WASM in release? #4399

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebAssembly and emscripten headers #97

WebAssembly and emscripten headers #97

loretoparisi commented Mar 13, 2023

MarkSchmidty commented Mar 13, 2023 •

edited

Loading

Dicklesworthstone commented Mar 13, 2023

MarkSchmidty commented Mar 13, 2023 •

edited

Loading

thypon commented Mar 20, 2023 •

edited

Loading

loretoparisi commented Mar 20, 2023

thypon commented Mar 20, 2023 •

edited

Loading

loretoparisi commented Mar 20, 2023

thypon commented Mar 20, 2023 •

edited

Loading

okpatil4u commented Apr 19, 2023

lapo-luchini commented May 9, 2023

IsaacRe commented May 17, 2023

thypon commented Jun 7, 2023

okpatil4u commented Jul 4, 2023

thypon commented Jul 4, 2023

lukestanley commented Jul 4, 2023 •

edited

Loading

rahuldshetty commented Jul 21, 2023

mindplay-dk commented Sep 16, 2023

mohamedmansour commented Sep 24, 2023

github-actions bot commented Apr 10, 2024

loretoparisi commented Apr 10, 2024

ggerganov commented Apr 10, 2024

flatsiedatsie commented Apr 12, 2024

ggerganov commented Apr 12, 2024

flatsiedatsie commented Apr 12, 2024 •

edited

Loading

loretoparisi commented Apr 24, 2024

flatsiedatsie commented Apr 24, 2024

slaren commented Apr 24, 2024

flatsiedatsie commented Apr 24, 2024

WebAssembly and emscripten headers #97

WebAssembly and emscripten headers #97

Comments

loretoparisi commented Mar 13, 2023

MarkSchmidty commented Mar 13, 2023 • edited Loading

Dicklesworthstone commented Mar 13, 2023

MarkSchmidty commented Mar 13, 2023 • edited Loading

thypon commented Mar 20, 2023 • edited Loading

loretoparisi commented Mar 20, 2023

thypon commented Mar 20, 2023 • edited Loading

loretoparisi commented Mar 20, 2023

thypon commented Mar 20, 2023 • edited Loading

okpatil4u commented Apr 19, 2023

lapo-luchini commented May 9, 2023

IsaacRe commented May 17, 2023

thypon commented Jun 7, 2023

okpatil4u commented Jul 4, 2023

thypon commented Jul 4, 2023

lukestanley commented Jul 4, 2023 • edited Loading

rahuldshetty commented Jul 21, 2023

mindplay-dk commented Sep 16, 2023

mohamedmansour commented Sep 24, 2023

github-actions bot commented Apr 10, 2024

loretoparisi commented Apr 10, 2024

ggerganov commented Apr 10, 2024

flatsiedatsie commented Apr 12, 2024

ggerganov commented Apr 12, 2024

flatsiedatsie commented Apr 12, 2024 • edited Loading

loretoparisi commented Apr 24, 2024

flatsiedatsie commented Apr 24, 2024

slaren commented Apr 24, 2024

flatsiedatsie commented Apr 24, 2024

MarkSchmidty commented Mar 13, 2023 •

edited

Loading

MarkSchmidty commented Mar 13, 2023 •

edited

Loading

thypon commented Mar 20, 2023 •

edited

Loading

thypon commented Mar 20, 2023 •

edited

Loading

thypon commented Mar 20, 2023 •

edited

Loading

lukestanley commented Jul 4, 2023 •

edited

Loading

flatsiedatsie commented Apr 12, 2024 •

edited

Loading