Lior Alexander (@LiorOnAI)
2025-05-01 | โค๏ธ 696 | ๐ 125
Must read: How to build an LLM inference engine using C++ and CUDA from scratch without libraries.
Andrew Chan shows how to match and beat llama.cpp with a custom Mistral-v0.2 backend.
It runs at 63.8 tok/s (short prompts) and 58.8 tok/s (long) on a single RTX 4090. https://x.com/LiorOnAI/status/1917958210016727248/photo/1
๐ ์๋ณธ ๋งํฌ
๋ฏธ๋์ด

๐ Related
Auto-generated - needs manual review