Lior Alexander (@LiorOnAI)

2025-05-01 | โค๏ธ 696 | ๐Ÿ” 125


Must read: How to build an LLM inference engine using C++ and CUDA from scratch without libraries.

Andrew Chan shows how to match and beat llama.cpp with a custom Mistral-v0.2 backend.

It runs at 63.8 tok/s (short prompts) and 58.8 tok/s (long) on a single RTX 4090. https://x.com/LiorOnAI/status/1917958210016727248/photo/1

๐Ÿ”— ์›๋ณธ ๋งํฌ

๋ฏธ๋””์–ด

image


Auto-generated - needs manual review

Tags

domain-ai-ml domain-visionos