Find the Best Local LLM for Your Hardware
whichllm: Find the best local LLM that actually runs on your hardware
Find the best local large language model (LLM) that can genuinely operate on your own machine. whichllm auto-detects your GPU, CPU, and RAM, and then ranks the top models from HuggingFace that fit your system. It’s built around data-driven, real-benchmarked comparisons rather than simple size heuristics, so you get usable recommendations tailored to your hardware profile. The project is active and data-driven, with live data that reflects current models and runtimes, and it even offers a bilingual gateway (日本語版はこちら) to broaden accessibility.
Introduction to whichllm
whichllm is more than a simple model chooser. It is a comprehensive tool designed for developers, researchers, and enthusiasts who want to experiment with local models without guessing which one will actually run efficiently on their hardware. Instead of offering a static list of the largest models, whichllm uses an evidence-based ranking system to pick the best-fitting choice for you—considering VRAM, memory usage, speed, and benchmark quality. It also simulates hardware, helps you plan upgrades, and provides scripts to integrate model usage into your workflows. The project embraces open-source principles, and it remains free to use with MIT-licensed software.
Header capability and quick-start visuals
- The tool automatically detects hardware: NVIDIA, AMD, Apple Silicon, or CPU-only, and then selects suitable models based on your actual hardware constraints.
- You can simulate a GPU before buying new hardware, or install and use whichllm in a matter of minutes.
- The project includes ready-to-use commands, a rich set of features, and an API-friendly JSON output for scripting and automation.
Quick start: one command to begin
You don’t need a pre-configured project to begin exploring models. A single command lays the groundwork:
- Quick start (no project setup): uvx whichllm@latest
- GPU simulation (planning for hardware): uvx whichllm@latest --gpu "RTX 4090"
If you want to install and use it regularly, you can install and upgrade via the provided toolchain:
- Install using the tool: uv tool install whichllm
- Upgrade an existing install: uv tool upgrade whichllm
There are other install paths for different environments:
- brew install andyyyy64/whichllm/whichllm
- pip install whichllm
Visuals show the experience
- See how the tool looks in action with a demonstration GIF: assets/demo.gif
- For a run-time demonstration of a chat session, there is a dedicated GIF: assets/demo-run.gif
Common workflows after installation
Once whichllm is installed, you can run it directly from the command line. If you are doing one-off experiments, you can replace whichllm with uvx whichllm@latest for a simulated experience or to explore what would happen on a given GPU.
- Best models for this machine: whichllm
- Pretend you have a specific GPU: whichllm --gpu "RTX 4090"
- Compare upgrade candidates: whichllm upgrade "RTX 4090" "RTX 5090" "H100"
- Find the GPU needed for a model: whichllm plan "llama 3 70b"
- Start a chat with a model: whichllm run "qwen 2.5 1.5b gguf"
- Print a copy-paste Python snippet: whichllm snippet "qwen 7b"
- Return JSON for scripts: whichllm --top 1 --json
See it in action
- A sample output illustrates ranking and performance:
- 1) Qwen/Qwen3.6-27B with high score and strong speed
- 2) Qwen/Qwen3-32B with slightly different characteristics
- 3) Qwen/Qwen3-30B-A3B with impressive speed on larger configurations
- The 32B model may fit your card, but whichllm ranks the 27B as the top choice based on real benchmarks, recency, and overall suitability. This demonstrates the core philosophy: model selection is about actual performance, not just size.
What can I run? Real top picks and live data
whichllm maintains a snapshot of top model recommendations (2026-05), but your results stay in sync with live HuggingFace data. It’s not a static list; it reflects the field as it evolves. The rankings depend on three pillars: actual VRAM fit, benchmark quality, and runtime characteristics, all blended with a cautious confidence model to avoid overclaiming results.
Hardware considerations (illustrative examples)
- RTX 5090 with 32 GB VRAM often yields top performers such as Qwen3.6-27B in high-quality configurations and strong throughput (about 40 tokens per second in some cases).
- RTX 4090 or 3090 at 24 GB VRAM commonly select Qwen3.6-27B as a solid top pick with strong balance between quality and speed (around 27 tokens per second).
- RTX 4060 with 8 GB VRAM may favor lighter models like Qwen3-14B, achieving solid speed even on limited memory ambit.
- Apple M3 Max with 36 GB VRAM can run similar high-quality models but with lower throughput due to architectural differences; the ranking still prioritizes quality and compatibility.
- CPU-only configurations rely on optimized MoE or smaller variants and typically yield lower throughput but are still usable in constrained environments.
Why whichllm? Core philosophy and design
- Evidence-based ranking, not merely a size heuristic: The top pick is chosen from merged real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) rather than simply the biggest model that fits.
- Recency-aware: whichllm demotes older generations as new ones appear, with the snapshot date clearly printed under each ranking to reveal freshness.
- Evidence-graded and guarded: Each score is annotated with its reliability context (direct / variant / base / interpolated / self-reported) and dampened by confidence, so you can gauge how much to trust a particular rating.
- Architecture-aware estimates: VRAM calculations consider weights, KV cache, activation, and overhead. Speed is modeled with bandwidth, quantization, backend differences, and whether MoE active elements are in use.
- One-command, scriptable: The CLI provides outputs suitable for parsing. You can pipe to tools like jq for automation in pipelines.
- Live data: Models are fetched directly from HuggingFace, with curated offline fallbacks if needed.
Features that set whichllm apart
- Auto-detect hardware: Works with NVIDIA, AMD, Apple Silicon, or CPU-only environments.
- Smart ranking: Balances VRAM fit, speed, and benchmark quality to produce practical recommendations.
- One-command chat: Start a chat session rapidly with whichllm run, which automatically selects the best available variant for your hardware.
- Code snippets: Print ready-to-run Python for any model with whichllm snippet.
- Live data: Fetch models directly from HuggingFace, with caching for performance.
- Benchmark-aware scoring: Integrates evaluation scores with a confidence-based dampening mechanism.
- Task profiles: Filter results by general, coding, vision, or math use cases.
- GPU simulation: Test a hypothetical GPU configuration via whichllm --gpu "RTX 4090".
- Hardware planning: Reverse lookup to determine what GPU you need for a given model.
- Upgrade planning: Compare your current machine with candidate GPUs to plan future improvements.
- JSON output: Output is friendly to pipelines, enabling scripting, logging, or integration into other systems.
Run & Snippet: practical examples
Try a guided session with a single command to chat with a model and observe how the tool handles selection and conversation. The demo illustrates the end-to-end flow: discovery, selection, loading, and interaction with a model.
- Basic chat with auto-pick: whichllm run "qwen 2.5 1.5b gguf"
- CPU-only mode: whichllm run # CPU-only mode
- Copy-paste Python snippet: whichllm snippet "qwen 7b"
- Copy-paste Python for a specific model: whichllm snippet "llama 3 8b gguf" --quant Q5KM
A concrete Python snippet example
- The following demonstrates how to instantiate a model using a copy-paste snippet, enabling quick integration into scripts:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
filename="qwen2.5-7b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=-1,
verbose=False,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])
Usage: flexible commands and filters
The command-line surface supports a wide array of options to tailor results:
- Auto-detect hardware and show best models: whichllm
- Simulate a GPU for planning: whichllm --gpu "RTX 4090"
- Specify a particular variant or context: whichllm --gpu "RTX 5060 16"
- CPU-only mode: whichllm --cpu-only
- Extended results: whichllm --top 20
- Quantization control: whichllm --quant Q4KM
- Speed filtering: whichllm --min-speed 30
- Evidence strictness: whichllm --evidence strict
- Output in JSON for pipelines: whichllm --json
Integrations and ecosystem
Ollama integration
- Use JSON output to map HuggingFace IDs to local Ollama model names:
- Pick the top HuggingFace model ID: whichllm --top 1 --json | jq -r '.models[0].model_id'
- Best coding model: whichllm --profile coding --top 1 --json | jq -r '.models[0].model_id'
- Note: Ollama model names may deviate from HuggingFace IDs, so a small mapping step is often required.
Shell aliases for convenience
- Add to your shell profile to streamline usage: alias bestllm='whichllm --top 1 --json | jq -r ".models[0].model_id"'
- Usage example: ollama run $(bestllm)
Scoring: how models are judged
Each model is assigned a 0-100 score driven by multiple factors:
- Benchmark quality: Core weight from LiveBench, Artificial Analysis, Aider, Vision, Arena ELO, and Open LLM Leaderboard.
- Model size: Log2-scaled proxy for world knowledge and capacity (MoE models use total parameters in some cases).
- Quantization: Discounts applied for lower-bit quantization.
- Evidence confidence: A multiplier based on whether the score is direct, variant, base, interpolated, or self-reported.
- Runtime fit: Discounts for partial offload or CPU-only configurations.
- Speed: A token-per-second metric adjusted by confidence and range data.
- Source trust: Adjustments for official vs repackaged sources.
- Popularity: A tie-breaker that fades as evidence strengthens.
Special score markers indicate data provenance and confidence
- ~ or yellow: No direct benchmark; score inherited or interpolated within the family
- !sr or bright yellow: Uploader-reported benchmark only, not independently verified
- ? red: No benchmark data available Speed markers explain the reliability of tok/s estimates
Documentation and how-to resources
Whichllm provides a comprehensive documentation suite to aid both new users and advanced integrators:
- CLI reference
- How it works
- Scoring
- Hardware detection and simulation
- Run and snippet
- Troubleshooting
How it works: data pipelines, rankings, and structure
Data pipeline
- Model fetching: whichllm fetches popular models from HuggingFace including text-generation models and special GGUF-filtered results, with a separate path for vision models when appropriate.
- Benchmark sources: Live data (LiveBench, Artificial Analysis Index, Aider) merged when available, with a curated multimodal index. A frozen tier (Open LLM Leaderboard v2, Chatbot Arena ELO) provides stability for offline or slower connections.
- Benchmark evidence: Five resolution levels (direct, variant, basemodel, lineinterp, self_reported) to maintain transparency and guard against misleading claims. Inheritance is rejected when params diverge beyond a threshold.
- Cache: A local cache in ~/.cache/whichllm/ stores models.json (6h TTL) and benchmark.json (24h TTL) for faster repeated queries.
Ranking engine
- Hardware detection: Uses NVIDIA, AMD, Apple Silicon detection, as well as CPU and RAM metrics.
- VRAM estimation: Models VRAM usage is computed as weights plus KV cache, activation, and framework overhead (~500MB baseline).
- Compatibility: Fully GPU-accelerated, partial offload, or CPU-only modes; compatibility checks ensure stable operation across environments.
- Speed: TOK/s estimation derived from memory bandwidth, quantization, backend, and MoE activity.
- Scoring: Combines benchmark quality, size considerations, quantization penalties, fit type, speed, popularity, and source trust to yield a final score.
- Backend filters: Apple Silicon and CPU-only environments are conservative, often locking to GGUF for stability; Linux+NVIDIA setups allow AWQ and GPTQ variants.
Project structure (high-level overview)
- src/whichllm/
- cli.py: Typer-based command-line interface with main, plan, run, snippet, hardware commands
- constants.py: Hardware-related constants such as bandwidth estimates and quantization bytes
- hardware/: detectors and simulators for GPU/CPU/RAM
- models/: HuggingFace fetchers, benchmarks, family grouping, and cache
- engine/: core VRAM, compatibility, performance, quantization, ranking, and type definitions
- output/: display logic for rich tables, JSON, and hardware/plan displays
- assets/: demo visuals for the blog and demonstration GIFs
Project culture and development
Contributing: whichllm welcomes community contributions. Documentation and guidelines are available in the repository. You can set up a local development environment, run tests, and contribute improvements to ranking, benchmarks, or integrations.
Development steps (quick reference)
- Clone the repository
- Install dependencies (using uv sync --dev)
- Run unit tests with uv run pytest
- Iterate on features, benchmarks, or integrations
- Contribute via pull requests and engage with maintainers
Support, licensing, and community
Support for users and contributors is encouraged. If whichllm helps you pick the right model or avoid a poor hardware guess, sponsorships help sustain ongoing maintenance, benchmark updates, packaging, and broader hardware coverage. The project remains open-source under the MIT license.
License and requirements
- License: MIT
- Requirements: Python 3.11+, NVIDIA GPU detection via nvidia-ml-py (included by default), automatic detection for AMD and Apple Silicon
- The project emphasizes live data integration with HuggingFace while offering offline fallbacks when needed.
Images and visuals in this post
- Top badges (license, Python version, tests, etc.) provide immediate status cues and project details.
- A demo GIF showcases run-time behavior and user experience: assets/demo.gif
- A run-through GIF demonstrates fast interaction and the one-command experience: assets/demo-run.gif
- A Star History chart visually communicates community adoption over time:
Conclusion: a practical tool for local LLM exploration
whichllm is designed for users who want practical, data-backed guidance on which local LLM to run on their hardware. It blends hardware auto-detection with live benchmarking data, ensuring that recommendations reflect current reality rather than a static, size-based heuristic. The combination of a robust data pipeline, rigorous scoring, flexible command-line options, and friendly integrations makes whichllm a compelling companion for anyone exploring local LLMs—from hobbyists to professionals building AI-powered applications.
If you want to dive in, start with a quick setup, explore the “whichllm” command, and watch how the tool adapts to your machine. The project remains openly accessible, inviting you to contribute, sponsor, or simply star the project to help others discover what works best on their rigs.
[Star History Chart] image above, the demo GIFs, and the README’s visual shields together give you a quick sense of the project’s momentum, capabilities, and practical value. Whether you’re planning a hardware upgrade, evaluating a new GPU, or just curious about the most effective local models, whichllm guides you with evidence-based, real-world data and a streamlined, scriptable interface.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/Andyyyy64/whichllm
GitHub - Andyyyy64/whichllm: Find the Best Local LLM for Your Hardware
whichllm is an open-source AI assistant that automatically detects your hardware and recommends local LLMs....
github - andyyyy64/whichllm
