Local LLM Chat

Model

How It Works

This tool runs a large language model entirely inside your browser tab using WebAssembly. No text you type is ever sent to a server — inference happens locally on your CPU.

Powered by Transformers.js

Transformers.js is the official JavaScript port of Hugging Face Transformers. It uses ONNX Runtime Web to run pre-converted ONNX model files inside the browser's WebAssembly engine — no plugins or native installs required.

Available Models

Model	Size (quantized)	Notes
TinyLlama 1.1B Chat	~640 MB	Instruction-tuned chat model; best results for Q&A and conversation.
GPT-2	~120 MB	Lightweight text-completion model. Fast to load; not instruction-tuned.

Implementation Details

On first use the browser downloads the selected model's ONNX weights from the Hugging Face Hub CDN and caches them in the browser's Cache Storage. Subsequent visits reuse the cached files with no re-download.
Tokenization (text → token IDs) and de-tokenization (token IDs → text) run in the same WASM thread using the model's bundled vocabulary.
The TextStreamer API streams generated tokens back to the page one at a time, so you see the response appear incrementally — just like server-side streaming APIs.
For chat models (TinyLlama), the full conversation history is passed on every turn using the model's built-in chat template, keeping the assistant aware of prior context.

Performance Expectations

WASM inference is CPU-bound and noticeably slower than GPU-accelerated APIs. Typical generation speed on a modern laptop is 1 – 10 tokens/second depending on the model and CPU. The first generation in a session may be slower as the WASM engine JIT-compiles the model graph.

Privacy

All inference runs locally in your browser tab. Your messages, model weights, and any generated text never leave your device.