Run Llama-3.1-8B directly in a browser

With state of the art 2 bit compression

How to try

Simply visit the homepage of this project.

How this works

TL;DR

I implemented multithreaded CPU inference for Llama 3.1-8b with AQLM+PV and LLM.int8() quantization from scratch in Rust. I then deployed it in the browser using WebAssembly and Web Workers.

How it started

In May 2024, my colleagues and I published a paper introducing a new compression method called PV-tuning. This method enhances existing compression techniques by refining compressed models without altering their representations. When combined with AQLM — a method developed in January 2024 — PV-tuning achieves state-of-the-art results in 2-bit compression.

Around the time we published our paper, Meta released Llama 3.1-8b. I thought it would be fun to run this model in a browser using our method, so I began working on it. Implementing LLM inference in a low-level language like Rust proved to be a fascinating challenge — one I highly recommend. It forces you to understand all the intricate details that you might overlook when using high-level libraries like Transformers.

The Power of Rust

During my bachelor's degree, I learned Rust at the Yandex School of Data Analysis. I liked the language but hadn't had a chance to use it in real projects. This project provided an excellent opportunity, as Rust is ideal for compiling to WebAssembly.

To my pleasant surprise, many fundamental ML technologies I needed were already implemented in Rust. These included Safetensors — a simple format for storing tensors made by Hugging Face — and tiktoken, the BPE tokenizer created by OpenAI that Llama 3.1 uses.

Multithreading with web workers

I implemented multithreading using web workers. These workers allow bidirectional communication through message passing.

My data-parallel solution splits all matrices by output dimension, with each worker receiving a single slice of each matrix.

The trickiest part was facilitating communication between workers and the main thread. To achieve this, I developed a custom RPC stack with Rust-JavaScript interoperability.

When a client needs to multiply a vector by a matrix, it creates a request enum instance, serializes it, and sends it to JavaScript. JavaScript then forwards this to the worker, which deserializes, processes, and re-serializes the result before sending it back. After that Javascript sends it to the client and the client deserializes it.

This approach boosted performance by approximately 2x on my M1 MacBook Pro.

Quantization kernels

I implemented quantization kernels for AQLM and LLM.int8(). While I managed to implement int8 quantization entirely in safe Rust, AQLM required some unsafe code.

I also conducted a small study on optimal data layout for AQLM quantization by code-generating different layouts and measuring their performance. It turned out that reordering data in memory can yield about a 10% performance improvement.

Open-source

I published the source code on GitHub. You can find it here.

Performance

Different computers should have different performance. On my M1 MacBook Pro, the inference speed is 1.4 tokens/s.

Credits

This project was done during my work at Yandex Research. I would also like to thank Daniil Pavlov for helping me with the project.

If you have any questions, feel free to contact me via email or Telegram. You can find my contact information here.