WebLLM: A High-Performance In-Browser LLM Inference Engine
We are excited to share a new chapter of the WebLLM project, the WebLLM engine: a high-performance in-browser LLM inference engine.
As we see promising opportunities for running capable models locally, web browsers form a universally accessible platform, allowing users to engage with any web applications without installation. Therefore, we think it would be great to integrate LLMs directly into the local browser, enabling people to run LLMs just by opening a webpage.
The WebLLM engine:
- Is accelerated by local GPU (via WebGPU) and optimized by machine learning compilation techniques (via MLC-LLM and TVM)
- Offers fully OpenAI-compatible API for both chat completion and structured JSON generation, allowing developers to treat WebLLM as a drop-in replacement for OpenAI API, but with any open-source models run locally
- Provides built-in support for Web Worker and Service Worker, separating out the backend execution from the UI flow, allowing developers to treat the engine as an endpoint
Try out Llama3, Phi3, Hermes-2-pro, Mistral v0.3, Qwen2, and more at: https://chat.webllm.ai/ (first ensure your browser supports WebGPU at https://webgpureport.org/ )
Please check out the blog post to see how to use WebLLM to build web applications and in-browser agents:
https://blog.mlc.ai/2024/06/13/webllm-a-high-performance-in-browser-llm-inference-engine
Let us know your feedback and we look forward to working with the community to bring open foundational models for everyone!