Find the Best Local LLM for Your Hardware

PyPI version

whichllm: Find the best local LLM that actually runs on your hardware

Find the best local large language model (LLM) that can genuinely operate on your own machine. whichllm auto-detects your GPU, CPU, and RAM, and then ranks the top models from HuggingFace that fit your system. It’s built around data-driven, real-benchmarked comparisons rather than simple size heuristics, so you get usable recommendations tailored to your hardware profile. The project is active and data-driven, with live data that reflects current models and runtimes, and it even offers a bilingual gateway (日本語版はこちら) to broaden accessibility.

See the Japanese version

Introduction to whichllm

whichllm is more than a simple model chooser. It is a comprehensive tool designed for developers, researchers, and enthusiasts who want to experiment with local models without guessing which one will actually run efficiently on their hardware. Instead of offering a static list of the largest models, whichllm uses an evidence-based ranking system to pick the best-fitting choice for you—considering VRAM, memory usage, speed, and benchmark quality. It also simulates hardware, helps you plan upgrades, and provides scripts to integrate model usage into your workflows. The project embraces open-source principles, and it remains free to use with MIT-licensed software.

Header capability and quick-start visuals

The tool automatically detects hardware: NVIDIA, AMD, Apple Silicon, or CPU-only, and then selects suitable models based on your actual hardware constraints.
You can simulate a GPU before buying new hardware, or install and use whichllm in a matter of minutes.
The project includes ready-to-use commands, a rich set of features, and an API-friendly JSON output for scripting and automation.

Quick start: one command to begin

You don’t need a pre-configured project to begin exploring models. A single command lays the groundwork:

Quick start (no project setup): uvx whichllm@latest
GPU simulation (planning for hardware): uvx whichllm@latest --gpu "RTX 4090"

If you want to install and use it regularly, you can install and upgrade via the provided toolchain:

Install using the tool: uv tool install whichllm
Upgrade an existing install: uv tool upgrade whichllm

There are other install paths for different environments:

brew install andyyyy64/whichllm/whichllm
pip install whichllm

Visuals show the experience

See how the tool looks in action with a demonstration GIF: assets/demo.gif
For a run-time demonstration of a chat session, there is a dedicated GIF: assets/demo-run.gif

Common workflows after installation

Once whichllm is installed, you can run it directly from the command line. If you are doing one-off experiments, you can replace whichllm with uvx whichllm@latest for a simulated experience or to explore what would happen on a given GPU.

Best models for this machine: whichllm
Pretend you have a specific GPU: whichllm --gpu "RTX 4090"
Compare upgrade candidates: whichllm upgrade "RTX 4090" "RTX 5090" "H100"
Find the GPU needed for a model: whichllm plan "llama 3 70b"
Start a chat with a model: whichllm run "qwen 2.5 1.5b gguf"
Print a copy-paste Python snippet: whichllm snippet "qwen 7b"
Return JSON for scripts: whichllm --top 1 --json

See it in action

A sample output illustrates ranking and performance:
1) Qwen/Qwen3.6-27B with high score and strong speed
2) Qwen/Qwen3-32B with slightly different characteristics
3) Qwen/Qwen3-30B-A3B with impressive speed on larger configurations
The 32B model may fit your card, but whichllm ranks the 27B as the top choice based on real benchmarks, recency, and overall suitability. This demonstrates the core philosophy: model selection is about actual performance, not just size.

What can I run? Real top picks and live data

whichllm maintains a snapshot of top model recommendations (2026-05), but your results stay in sync with live HuggingFace data. It’s not a static list; it reflects the field as it evolves. The rankings depend on three pillars: actual VRAM fit, benchmark quality, and runtime characteristics, all blended with a cautious confidence model to avoid overclaiming results.

Hardware considerations (illustrative examples)

RTX 5090 with 32 GB VRAM often yields top performers such as Qwen3.6-27B in high-quality configurations and strong throughput (about 40 tokens per second in some cases).
RTX 4090 or 3090 at 24 GB VRAM commonly select Qwen3.6-27B as a solid top pick with strong balance between quality and speed (around 27 tokens per second).
RTX 4060 with 8 GB VRAM may favor lighter models like Qwen3-14B, achieving solid speed even on limited memory ambit.
Apple M3 Max with 36 GB VRAM can run similar high-quality models but with lower throughput due to architectural differences; the ranking still prioritizes quality and compatibility.
CPU-only configurations rely on optimized MoE or smaller variants and typically yield lower throughput but are still usable in constrained environments.

Why whichllm? Core philosophy and design

Evidence-based ranking, not merely a size heuristic: The top pick is chosen from merged real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) rather than simply the biggest model that fits.
Recency-aware: whichllm demotes older generations as new ones appear, with the snapshot date clearly printed under each ranking to reveal freshness.
Evidence-graded and guarded: Each score is annotated with its reliability context (direct / variant / base / interpolated / self-reported) and dampened by confidence, so you can gauge how much to trust a particular rating.
Architecture-aware estimates: VRAM calculations consider weights, KV cache, activation, and overhead. Speed is modeled with bandwidth, quantization, backend differences, and whether MoE active elements are in use.
One-command, scriptable: The CLI provides outputs suitable for parsing. You can pipe to tools like jq for automation in pipelines.
Live data: Models are fetched directly from HuggingFace, with curated offline fallbacks if needed.

Features that set whichllm apart

Auto-detect hardware: Works with NVIDIA, AMD, Apple Silicon, or CPU-only environments.
Smart ranking: Balances VRAM fit, speed, and benchmark quality to produce practical recommendations.
One-command chat: Start a chat session rapidly with whichllm run, which automatically selects the best available variant for your hardware.
Code snippets: Print ready-to-run Python for any model with whichllm snippet.
Live data: Fetch models directly from HuggingFace, with caching for performance.
Benchmark-aware scoring: Integrates evaluation scores with a confidence-based dampening mechanism.
Task profiles: Filter results by general, coding, vision, or math use cases.
GPU simulation: Test a hypothetical GPU configuration via whichllm --gpu "RTX 4090".
Hardware planning: Reverse lookup to determine what GPU you need for a given model.
Upgrade planning: Compare your current machine with candidate GPUs to plan future improvements.
JSON output: Output is friendly to pipelines, enabling scripting, logging, or integration into other systems.

Run & Snippet: practical examples

Try a guided session with a single command to chat with a model and observe how the tool handles selection and conversation. The demo illustrates the end-to-end flow: discovery, selection, loading, and interaction with a model.

Basic chat with auto-pick: whichllm run "qwen 2.5 1.5b gguf"
CPU-only mode: whichllm run # CPU-only mode
Copy-paste Python snippet: whichllm snippet "qwen 7b"
Copy-paste Python for a specific model: whichllm snippet "llama 3 8b gguf" --quant Q5KM

A concrete Python snippet example

The following demonstrates how to instantiate a model using a copy-paste snippet, enabling quick integration into scripts:

from llama_cpp import Llama
llm = Llama.from_pretrained(
  repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
  filename="qwen2.5-7b-instruct-q4_k_m.gguf",
  n_ctx=4096,
  n_gpu_layers=-1,
  verbose=False,
)
output = llm.create_chat_completion(
  messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])

Usage: flexible commands and filters

The command-line surface supports a wide array of options to tailor results:

Auto-detect hardware and show best models: whichllm
Simulate a GPU for planning: whichllm --gpu "RTX 4090"
Specify a particular variant or context: whichllm --gpu "RTX 5060 16"
CPU-only mode: whichllm --cpu-only
Extended results: whichllm --top 20
Quantization control: whichllm --quant Q4KM
Speed filtering: whichllm --min-speed 30
Evidence strictness: whichllm --evidence strict
Output in JSON for pipelines: whichllm --json

Integrations and ecosystem

Ollama integration

Use JSON output to map HuggingFace IDs to local Ollama model names:
Pick the top HuggingFace model ID: whichllm --top 1 --json | jq -r '.models[0].model_id'
Best coding model: whichllm --profile coding --top 1 --json | jq -r '.models[0].model_id'
Note: Ollama model names may deviate from HuggingFace IDs, so a small mapping step is often required.

Shell aliases for convenience

Add to your shell profile to streamline usage: alias bestllm='whichllm --top 1 --json | jq -r ".models[0].model_id"'
Usage example: ollama run $(bestllm)

Scoring: how models are judged

Each model is assigned a 0-100 score driven by multiple factors:

Benchmark quality: Core weight from LiveBench, Artificial Analysis, Aider, Vision, Arena ELO, and Open LLM Leaderboard.
Model size: Log2-scaled proxy for world knowledge and capacity (MoE models use total parameters in some cases).
Quantization: Discounts applied for lower-bit quantization.
Evidence confidence: A multiplier based on whether the score is direct, variant, base, interpolated, or self-reported.
Runtime fit: Discounts for partial offload or CPU-only configurations.
Speed: A token-per-second metric adjusted by confidence and range data.
Source trust: Adjustments for official vs repackaged sources.
Popularity: A tie-breaker that fades as evidence strengthens.

Special score markers indicate data provenance and confidence

~ or yellow: No direct benchmark; score inherited or interpolated within the family
!sr or bright yellow: Uploader-reported benchmark only, not independently verified
? red: No benchmark data available Speed markers explain the reliability of tok/s estimates

Documentation and how-to resources

Whichllm provides a comprehensive documentation suite to aid both new users and advanced integrators:

CLI reference
How it works
Scoring
Hardware detection and simulation
Run and snippet
Troubleshooting

How it works: data pipelines, rankings, and structure

Data pipeline

Model fetching: whichllm fetches popular models from HuggingFace including text-generation models and special GGUF-filtered results, with a separate path for vision models when appropriate.
Benchmark sources: Live data (LiveBench, Artificial Analysis Index, Aider) merged when available, with a curated multimodal index. A frozen tier (Open LLM Leaderboard v2, Chatbot Arena ELO) provides stability for offline or slower connections.
Benchmark evidence: Five resolution levels (direct, variant, basemodel, lineinterp, self_reported) to maintain transparency and guard against misleading claims. Inheritance is rejected when params diverge beyond a threshold.
Cache: A local cache in ~/.cache/whichllm/ stores models.json (6h TTL) and benchmark.json (24h TTL) for faster repeated queries.

Ranking engine

Hardware detection: Uses NVIDIA, AMD, Apple Silicon detection, as well as CPU and RAM metrics.
VRAM estimation: Models VRAM usage is computed as weights plus KV cache, activation, and framework overhead (~500MB baseline).
Compatibility: Fully GPU-accelerated, partial offload, or CPU-only modes; compatibility checks ensure stable operation across environments.
Speed: TOK/s estimation derived from memory bandwidth, quantization, backend, and MoE activity.
Scoring: Combines benchmark quality, size considerations, quantization penalties, fit type, speed, popularity, and source trust to yield a final score.
Backend filters: Apple Silicon and CPU-only environments are conservative, often locking to GGUF for stability; Linux+NVIDIA setups allow AWQ and GPTQ variants.

Project structure (high-level overview)

src/whichllm/
cli.py: Typer-based command-line interface with main, plan, run, snippet, hardware commands
constants.py: Hardware-related constants such as bandwidth estimates and quantization bytes
hardware/: detectors and simulators for GPU/CPU/RAM
models/: HuggingFace fetchers, benchmarks, family grouping, and cache
engine/: core VRAM, compatibility, performance, quantization, ranking, and type definitions
output/: display logic for rich tables, JSON, and hardware/plan displays
assets/: demo visuals for the blog and demonstration GIFs

Project culture and development

Contributing: whichllm welcomes community contributions. Documentation and guidelines are available in the repository. You can set up a local development environment, run tests, and contribute improvements to ranking, benchmarks, or integrations.

Development steps (quick reference)

Clone the repository
Install dependencies (using uv sync --dev)
Run unit tests with uv run pytest
Iterate on features, benchmarks, or integrations
Contribute via pull requests and engage with maintainers

Support, licensing, and community

Support for users and contributors is encouraged. If whichllm helps you pick the right model or avoid a poor hardware guess, sponsorships help sustain ongoing maintenance, benchmark updates, packaging, and broader hardware coverage. The project remains open-source under the MIT license.

License and requirements

License: MIT
Requirements: Python 3.11+, NVIDIA GPU detection via nvidia-ml-py (included by default), automatic detection for AMD and Apple Silicon
The project emphasizes live data integration with HuggingFace while offering offline fallbacks when needed.

Images and visuals in this post

Top badges (license, Python version, tests, etc.) provide immediate status cues and project details.
A demo GIF showcases run-time behavior and user experience: assets/demo.gif
A run-through GIF demonstrates fast interaction and the one-command experience: assets/demo-run.gif
A Star History chart visually communicates community adoption over time:

Conclusion: a practical tool for local LLM exploration

whichllm is designed for users who want practical, data-backed guidance on which local LLM to run on their hardware. It blends hardware auto-detection with live benchmarking data, ensuring that recommendations reflect current reality rather than a static, size-based heuristic. The combination of a robust data pipeline, rigorous scoring, flexible command-line options, and friendly integrations makes whichllm a compelling companion for anyone exploring local LLMs—from hobbyists to professionals building AI-powered applications.

If you want to dive in, start with a quick setup, explore the “whichllm” command, and watch how the tool adapts to your machine. The project remains openly accessible, inviting you to contribute, sponsor, or simply star the project to help others discover what works best on their rigs.

[Star History Chart] image above, the demo GIFs, and the README’s visual shields together give you a quick sense of the project’s momentum, capabilities, and practical value. Whether you’re planning a hardware upgrade, evaluating a new GPU, or just curious about the most effective local models, whichllm guides you with evidence-based, real-world data and a streamlined, scriptable interface.

Find the Best Local LLM for Your Hardware

Enjoying this project?

GitHub - Andyyyy64/whichllm: Find the Best Local LLM for Your Hardware

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category

What's New