DwarfStar 4: A Native Inference Engine for DeepSeek V4 Flash
DwarfStar 4: A DeepSeek V4 Flash Native Inference Engine
Introduction DwarfStar 4 (DS4) is a compact, purpose-built native inference engine tailored specifically for DeepSeek V4 Flash. It is not a generic GGUF runner, nor a wrapper around another runtime. DS4 is deliberately self-contained, focusing on delivering a fast, correct, and cohesive experience for loading, prompting, tool calling, KV state management (both RAM and on-disk), server APIs, and an integrated coding agent. The project also provides tooling for GGUF and imatrix generation, as well as quality and speed testing. The architecture centers on a tightly integrated stack designed to work with coding agents or the provided CLI interface, ensuring end-to-end usability in high-end personal machines with substantial memory.
Motivations: Why a standalone engine for DeepSeek v4 Flash? DS4 was born from a clear set of observations about DeepSeek v4 Flash relative to other models. Its creators argue that the model is uniquely suited to a dedicated engine for several reasons:
- Speed with fewer active parameters: DeepSeek v4 Flash trades some parameter count for speed, delivering faster inference in practice.
- Thinking mode efficiency: By avoiding “max thinking,” the model frequently produces a thinking section that’s shorter than other models, often around a fifth or less of the length for similar problems. Crucially, the thinking section scales with problem complexity, allowing feasible usage with thinking enabled even when other models become impractical.
- Exceptional context window: The model supports a context window of one million tokens, enabling long-context reasoning and recall.
- Knowledge at edge: As a large model, it benefits from sampling near the edge of knowledge, yielding nuanced responses in topics like Italian culture or politics that reveal parameter advantages over smaller models.
- Language quality: It delivers notably strong English and Italian prose, giving the feel of a frontier model.
- Compressed KV cache and on-disk persistence: The KV cache is highly compressed, enabling long-context inference on local machines and on-disk persistence for the KV store.
- Quantization readiness: It performs well with a 2-bit quantization scheme, when quantized in a targeted way (details described later). This supports running on machines with 128 GB RAM (and anecdotal reports of 96 GB working, even with extremely large contexts).
- Ongoing evolution: The project anticipates updated v4 Flash releases that improve performance and capabilities.
A few pragmatic notes accompany these motivations:
- The local-inference landscape includes many capable projects, but DS4 commits to a narrow, model-at-a-time approach with official-vector validation (logits aligned to the official implementation), long-context testing, and meaningful agent integration.
- The project credits GPT-5.5-assisted development and human-led testing, acknowledging that AI-assisted code is a collaboration, not a presentable substitute for human engineering.
- The KV cache concept is central: the cache is treated as a first-class disk citizen rather than a RAM-only artifact, enabling practical long-context inference on consumer hardware.
- The end-to-end vision emphasizes three components working well out of the box: an inference engine with HTTP API, GGUF files finely tuned to the engine and its assumptions, and testing/validation with coding-agent implementations.
- The engine is designed to work specifically with the provided GGUFs; it is not a general-purpose loader for arbitrary GGUF files with differing tensor layouts or quantization schemes.
- The project acknowledges beta quality status and encourages traceable debugging via tracing flags and issue reporting.
Acknowledgments The DS4 project owes significant debt to LLama.cpp and GGML. Although ds4.c does not directly link against GGML, it exists thanks to the path opened by the llama.cpp project, its kernels, quantization formats, GGUF ecosystem, and the engineering work in that space. The team retains GGML copyright notices where appropriate and relies on the GGUF quant layouts, CPU quant/dot logic, and certain kernels as essential references.
Model Weights and Quantization This implementation is tightly coupled with the DeepSeek V4 Flash GGUFs published for this project. It is not a universal GGUF loader; the tensor layout, quantization mix, metadata, and optional MTP state are all expectations of the engine, not guarantees for arbitrary GGUFs.
- Quantization strategy: The 2-bit quantizations are asymmetric and selective. Routed MoE experts are quantized with an up-gate at IQ2XXS and a down at Q2K. The majority of the model’s parameters fall into this quantized bucket, while other components (shared experts, projections, routing) are left untouched to preserve quality.
- Imatrix preference: Imatrix variants are recommended for download and use, with legacy non-imatrix quants available if needed.
- Download workflow: A helper script downloads the GGUFs from HuggingFace, stores them under ./gguf/, and updates ds4flash.gguf to point to the chosen model variant. Commands look like:
- sh ./download_model.sh q2-imatrix
- sh ./download_model.sh q4-imatrix
- sh ./download_model.sh q2
- sh ./download_model.sh q4
- Model variants:
- q2-imatrix and q4-imatrix: imatrix-tuned variants for larger RAM footprints (96–128 GB for q2-imatrix; 256+ GB for q4-imatrix).
- q2 and q4: legacy non-imatrix quantizations.
- Optional speculative decoding: The MTP (Speculative Decoding) support GGUF can be fetched with sh ./download_model.sh mtp, enabled via --mtp. The speculative path is experimental, gated for correctness, and currently offers at most modest speedups.
Building and Running
- Build steps vary by platform:
- macOS: make
- Linux (CUDA target): make cuda-spark for DGX Spark / GB10, or make cuda-generic for standard CUDA GPUs
- CPU-only path for correctness checks (not production): make cpu or run ds4 with --cpu
- Run ds4flash.gguf as the default model path. Use -m to select different GGUFs under ./gguf/.
- A note about macOS stability: current macOS virtual memory implementations can crash the kernel when running CPU code. This is a platform-specific caveat.
Speed and Benchmarks Speed benchmarking demonstrates how DS4 performs under different configurations. The project includes a speed table and a representative image that captures performance for different hardware configurations. A key image is embedded here to illustrate t/s performance across machines:
- Image:
Sample benchmarking setup (single-run Metal CLI numbers with specific context and decoding settings):
- Context: --ctx 32768
- Thinking: --nothink (or think enabled with default thinking)
- Greedy decoding: -n 256
- Short prompts vs long prompts test chunked prefill and long-context decoding
- Note: Q4 benchmarks require larger memory classes; some entries show N/A for machines that cannot meet requirements.
Native Agent: A Session-Centric Approach DS4 includes a native coding agent that diverges from conventional architectures. Rather than communicating via sockets or HTTP boundaries, the inference flow is controlled from within the agent itself, and the session is represented by the on-disk KV cache. This design yields several advantages:
- Low-latency experience: Rendering and tool calls occur with minimal boundary overhead, bounded primarily by prefill speed limits.
- Live prefill progress: A visible progress bar during prefill provides real-time feedback on how the session is advancing.
- Natively handled tools: DSML tool calls are managed inside the LLM pathway, eliminating the need for translation layers.
- Guaranteed KV consistency: The current on-disk state is the truth, reducing risk of KV-cache mismatch.
- Model-tuned behavior: The native agent is optimized for this particular model and its idiosyncrasies.
- Seamless session switching: You can switch sessions with /list and /switch without requiring a prefill, enabling flexible multi-session workflows.
Server and API: HTTP Interface with OpenAI/Anthropic Compatibility DS4 ships with a local server that emulates interfaces familiar to developers and tools:
- Endpoints and capabilities:
- GET /v1/models
- GET /v1/models/deepseek-v4-flash
- POST /v1/chat/completions
- POST /v1/responses
- POST /v1/completions
- POST /v1/messages
- Interaction models:
- Chat completions accept standard OpenAI messages plus tool usage information, streaming option, and various generation controls.
- Responses endpoint is aligned with Codex expectations, returning a structured event lifecycle suitable for downstream clients.
- Anthropic endpoints (Claude Code style) are supported, with thinking blocks streamed in a structured DSML form.
- Tooling and canonicalization:
- DS4 emits tool calls as DSML text blocks, but client agents translate these into canonical OpenAI/Anthropic JSON tool-call objects for compatibility.
- An unguessable tool ID is used to map back to the exact DSML block previously sampled, enabling robust exact replay across server restarts and restreams.
- Canonicalization acts as a fallback: if exact replay is unavailable, a deterministic DSML form is generated from the JSON tool object. This maintains transcript integrity even when exact DSML blocks are missing.
- Replay and synchronization:
- If the model’s live token stream diverges from the anticipated rendered prompt, the server can rewrite the live checkpoint or revert to a disk KV snapshot and replay the suffix to maintain transcript fidelity.
- Deterministic decoding is applied to DSML syntax (tags, parameter headers, JSON punctuation) to ensure DSML tooling remains parseable, while payloads (including file bodies) may still reflect the model’s normal sampling dynamics.
- Cross-origin and deployment options:
- For browser clients served from a different origin, the server can be launched with --cors to emit Access-Control-Allow-* headers.
- Use --host 0.0.0.0 to allow external machines to connect, if desired.
Tool Call Handling and Canonicalization: A Key to Reproducibility Canonicalization and exact replay are central to DS4’s reliability in multi-client or long-running use cases. The server action plan includes:
- Exact replay safety: Each tool call gets a unique API tool ID; the server remembers the exact DSML block associated with that ID in an in-memory map, backed by radix trees for efficient lookups.
- Replay on next turn: When a client re-sends a tool call, the server replays the exact DSML block to synchronize with the model’s prior sampling.
- Disk-backed replay memory: The map can be persisted into the KV cache to survive server restarts and session transitions.
- Deterministic fallbacks: If exact DSML replay is disabled or the block is missing, the server renders a deterministic DSML block from the JSON tool object, preserving transcript alignment with the client’s perspective.
Agent-Client Usage: Integrating with OpenAI/Codex/Claude-style Clients DS4-server can be consumed by local coding agents that implement OpenAI-compatible chat completions. The recommended workflow is:
- Start the server with a sensible context window:
- sh ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
- Context sizing and memory considerations:
- A 1M token context can demand significant RAM (roughly 26 GB for the compressed indexer alone).
- For RAM-constrained machines (e.g., 128 GB or less), a 100k–300k token context is a pragmatic target, with 2-bit quantization helping stay within memory budgets.
- Integrating OpenAI-based clients:
- opencode integration via ~/.config/opencode/opencode.json, providing base URLs and model entries for deepseek-v4-flash with generous context and output budgets.
- Pi integration via ~/.pi/agent/models.json and related settings to wire up the local DS4 model into a Pi agent environment.
- Codex CLI integration using the Responses wire API with a ds4 base URL, enabling a local, OpenAI-compatible Codex environment.
- Claude Code integration via a wrapper around Anthropic-style endpoints, enabling local Claude Code usage with the DS4 local model.
- Thinking modes and controls:
- DeepSeek V4 Flash supports thinking, non-thinking, and Think Max modes. The server defaults to thinking mode, with Think Max requested via reasoning_effort to scale thinking depth for very large contexts.
- Direct responses can be issued by using thinking: {"type":"disabled"} or thinking-off aliases like deepseek-chat.
Inference, Debugging, and Validation
- Thinking modes and controls:
- The thinking mode structure helps manage the model’s reasoning process. In thinking mode, reasoning is streamed separately from text output, enabling a more interpretable and controllable generation flow.
- Debugging aids:
- dump-tokens and dump-logprobs tooling help diagnose tokenization, prompt rendering, and logit behavior.
- The trace facility in ds4-server records prompts, cache decisions, generated text, and tool-parser events across sessions, aiding post-mortem debugging.
- Validation and testing:
- ds4-bench measures instantaneous prefill and generation throughput at context frontiers, using a fixed token sequence to examine performance at escalating context sizes.
- ds4-eval provides a real-model integration benchmark with embedded questions, a TUI, and a per-question report, focusing on a mix of scientific, mathematical, and security-related tasks. The evaluation suite is designed to stress both thinking and non-thinking capabilities, with a mixture of GPQA Diamond, SuperGPQA, AIME 2025, and COMPSEC-style questions.
Test Vectors and Vectors-Replay
- Test vectors live in tests/test-vectors and are generated by ds4 with -dump-logprobs settings. They enable precise, token-level comparisons against official DeepSeek V4 Flash continuations.
- The test suite is driven by the C runner (make test), with ds4_test and logprob-vectors capturing regression-sensitive information to guard against tokenizer or attention regressions.
CLI: One-Shot and Interactive Prompts
- One-shot prompt examples:
- sh ./ds4 -p "Explain Redis streams in one paragraph."
- Interactive CLI:
- Without -p, the ds4 CLI enters an interactive prompt session (ds4>), maintaining a live transcript and a KV checkpoint visualization for each turn.
- Useful commands include /help, /think, /think-max, /nothink, /ctx N, /read FILE, /quit.
- The CLI defaults to thinking mode, with options to disable thinking as needed.
Disk KV Cache: A Robust Persistence Layer Disk-based KV caching enables session continuity across restarts and cross-session reuse of prefixes, dramatically improving the user experience for long-form conversations and recurring tasks. Key features:
- A single live in-memory KV cache tracks the active session; on-disk KV stores prefixes that can be reloaded later for session resumption.
- Cache keys are SHA1 hashes of the rendered byte prefixes; the cache stores both the rendered text and the associated DS4 session payload for precise replay.
- The on-disk KVC file format includes:
- A fixed header with fields like magic string KVC, version, quant bits, save reason, extension flags, context size, and creation/unix times.
- Rendered text bytes and optional tool-id maps that connect tool IDs to exact DSML blocks.
- A DS4 session payload, which describes the model state, including:
- Token IDs, logits, layer counts, and KV contents per layer.
- Live KV rows, both raw and compressed (including ratio-4 compressor data for some layers).
- The layout is designed to be portable only across compatible ds4.c builds and model layout.
- The tool-id map section in KVC files provides exact mappings from tool IDs to the precise DSML blocks the model sampled, enabling exact replay during restart and client resends.
- Replays and cache management:
- A KVC hit triggers restoration of the session payload first, followed by loading the tool-id map if present.
- The server can load only the relevant tool mappings by scanning cache files, improving performance and resilience in restart scenarios.
- The cache includes markers for different states (cold, continued, evict, shutdown) and uses alignment rules to minimize token retokenization issues.
- Cache controls:
- Parameters like --kv-cache-min-tokens, --kv-cache-cold-max-tokens, --kv-cache-continued-interval-tokens, and token-boundary alignment controls allow tuning of how aggressively the system persists and resumes state.
- Reuse considerations:
- By default, checkpoints may be reused across 2-bit and 4-bit routed-expert variants if the rendered prefix matches. A strict reuse policy can be enforced with --kv-cache-reject-different-quant.
Backends: Graph Engines and Hardware Targets
- Native graph backends:
- Metal on macOS is the default target for speed and efficiency.
- CUDA on Linux provides a high-performance GPU path (DGX Spark / GB10 configurations available via make cuda-spark).
- CPU path exists for correctness checks and diagnostics; it is not intended for production inference due to performance constraints.
- How to select backends:
- macOS: ds4 -p "Hello" --metal
- Linux CUDA: ds4 -p "Hello" --cuda
- Linux CUDA (DGX Spark): make cuda-spark
- Linux CUDA (generic): make cuda-generic
- CPU-only: ds4 -p "Hello" --cpu
Steering: Controlling Model Behavior with Vector Directions DS4 supports steering with single-vector activation directions. The steering framework, found in the dir-steering directory, follows the core idea of Refusal in Language Models Is Mediated by a Single Direction (as cited in the project documentation). Steering can influence:
- Verbosity: Make the model more or less verbose.
- Task focus: Suppress or emphasize certain types of queries (e.g., programming questions in a chatbot for a car rental site).
- Speed: steer towards faster responses at the cost of some depth.
- Cybersecurity posture: reduce willingness to provide dual-use or offensive guidance, useful for safety research and hardened deployments. This steering capability is designed to be fast and lightweight compared to model fine-tuning.
Test Vectors and Debugging Toolkit
- Test vectors and test harnesses live under tests/test-vectors and a dedicated ds4_test runner.
- The test suite is designed to catch regressions early, especially around tokenizer boundaries and attention behavior.
- Debugging toolchain includes:
- ds4 --dump-tokens for exact token-level views of the input prompt.
- ds4 --dump-logprobs for log-probability distributions and top-k sampling views.
- ds4-server --trace to capture a complete session trace for offline analysis.
Acknowledgments and Community DS4 openly acknowledges the broader inference-engine ecosystem and the influence of the llama.cpp and GGML communities. It emphasizes that the project would not exist without the foundational work, kernels, and formats from these projects, and it encourages readers to consult the linked repositories for deeper context and related tooling (GGUF, imatrix, and related calibration and testing pipelines).
Conclusion: A Focused, End-to-End Local Inference Experience DwarfStar 4 represents a focused effort to deliver a finished, end-to-end local inference experience tailored to DeepSeek V4 Flash. By combining a dedicated inference engine, a robust disk-based KV cache, a native coding agent, and a complete server API with OpenAI/Anthropic compatibility, DS4 aims to make high-end local inference practical on personal hardware. Its design choices—a compact, model-specific engine; explicit disk-backed KV state; exact replay for tool calls; and a strong emphasis on practical testing and validation—reflect a belief that local, self-contained inference can be both powerful and maintainable.
If you are exploring DeepSeek V4 Flash on a MacBook, a Mac Studio, or a Linux workstation with ample RAM, DS4 offers a coherent path to leverage the model’s large context window, efficient 2-bit quantization, dense English and Italian generation capabilities, and the practical benefits of a local, end-to-end inference stack. The project’s ongoing work and beta-quality status invite community collaboration: users are encouraged to run tests, provide traces, contribute to the packaging and tooling, and help push toward a stable, production-ready experience that remains faithful to the model’s strengths and the engineering ethos of a purpose-built, local AI stack.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/antirez/ds4
GitHub - antirez/ds4: DwarfStar 4: A Native Inference Engine for DeepSeek V4 Flash
DwarfStar 4 (DS4) is a compact, purpose-built native inference engine tailored specifically for DeepSeek V4 Flash....
github - antirez/ds4