cuda-oxide

cuda-oxide: A Rust-native CUDA kernel compiler backend

Introduction: a new path for GPU programming in Rust

cuda-oxide is a custom rustc backend designed for compiling GPU kernels in pure Rust. It represents a bold approach to GPU development: host and device code can live together in a single file and be built with a single cargo oxide build command. The project blends multiple layers into a cohesive workflow—from the Rust language down to the NVPTX code that runs on CUDA GPUs.
The overarching goal is to provide a safe, ergonomic, and high-performance path for Rustaceans to author SIMT kernels without relying on external DSLs or bindings to other languages. The project is intentionally experimental and in alpha. Expect bugs, evolving APIs, and ongoing refinements as the team experiments with the architecture, optimizations, and user feedback.

Project Overview: what cuda-oxide brings to the table

A unified single-source compilation model
Host and device code share the same source file, enabling seamless intermixing and easier reasoning about data movement and kernel logic.
A straightforward workflow uses cargo oxide build to produce both host and device artifacts in a single, cohesive pipeline.
A rustc codegen backend that targets CUDA PTX
Functions annotated with a #[kernel] attribute are lowered to PTX, enabling CUDA execution on NVIDIA GPUs.
The system demonstrates how high-level Rust abstractions translate into GPU-executable constructs.
Rich device-side abstractions
Type-safe indexing, shared memory, and barriers to synchronize work within and across thread blocks.
Advanced features such as TMA (Tensor Memory Access), warp and cluster level operations, and scoped atomics to support modern CUDA hardware.
A host-side runtime for memory management and kernel launching
Safe wrappers around context, streams, device buffers, and pinned host transfers.
A runtime that makes it feasible to manage data movement, stream synchronization, and kernel launches in Rust.
A Rust-native compilation pipeline powered by Pliron
The pipeline traverses Rust MIR, passes through a dialect representation (dialect-mir), flows into Pliron IR, then LLVM IR, and finally emits PTX.
Pliron is an MLIR-like IR framework in Rust, enabling modular and extensible transformations that map Rust semantics to GPU instructions.
Cross-crate kernels and device-side LTO
Device-side LTO enables cross-crate optimizations and better code reuse for kernels.
A pathway for device FFI and interop between Rust and compiled device code.

Project Status: an active research and development project

cuda-oxide is an experimental compiler that shows how CUDA SIMT kernels can be authored natively in pure Rust, without DSLs or foreign bindings.
The project operates in an ongoing alpha phase, with frequent changes as core ideas are tested and refined.
Contributions are welcome. If you’re interested in contributing, the project provides clear guidelines and a path to participate via the CONTRIBUTING.md document.

Quick Start: getting hands-on with cuda-oxide

A practical entry point is to compile and run a kernel using a single command that demonstrates the end-to-end path from Rust to PTX to GPU execution.
The basic flow involves defining a kernel, preparing host data, compiling via cargo oxide, launching the kernel, and retrieving results from the device.
For asynchronous workflows, the CUDA ecosystem and the cuda-async crate enable composing operations that can be synchronized or awaited as needed, enabling a modern approach to GPU work.

Setup: prerequisites and installation

Requirements
cargo-oxide: a cargo subcommand that drives the build pipeline (commands like cargo oxide run, build, debug, etc.).
Rust nightly with rust-src and rustc-dev components (pinned in rust-toolchain.toml).
CUDA Toolkit (12.x or newer) for device execution support.
LLVM 21+ with a NVPTX backend; llc must be in your PATH. This is important because newer CUDA targets and hardware require the latest backend capabilities.
Clang plus libclang development headers (clang-21 and libclang-dev) for the host cuda-bindings crate. bindgen depends on the full clang toolchain to generate FFI code.
Linux (Ubuntu 24.04 tested; other distros may work with matching package names and versions).
Why LLVM 21? The project relies on intrinsics and codegen paths (such as TMA, tcgen05, and Warp/SM-level features) that are not reliably supported by LLVM 20 or earlier. Hopper/Blackwell-era targets may require LLVM 21 to function correctly.
Dev Container option: The repository includes a devcontainer setup for a reproducible CUDA, LLVM, Clang, and Rust environment. This can simplify onboarding and ensure consistency across machines.

Install: a practical path to getting started

Inside the cuda-oxide repository, cargo oxide is available via a workspace alias and can be installed directly.
For external projects, you can install the tool globally:
cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
Toolchain setup
Toolchain is installed via rust-toolchain.toml for automatic setup; if needed, you can manually install a nightly toolchain and the necessary components (rust-src, rustc-dev) for the chosen nightly version.
CUDA and LLVM setup
Ensure nvcc is accessible in your PATH.
Install LLVM 21 (or use the LLVM apt helper if your distro supports it) and verify the presence of llc-21 with an NVPTX target:
- llc-21 --version | grep nvptx
For Clang, install clang-21 or libclang-common-21-dev to fulfill host binding generation needs.
Verification: a quick health check
Run cargo oxide doctor to verify your Rust toolchain, CUDA toolkit, LLVM, and codegen backend are configured correctly.
Build and run a small example, such as vecadd, to confirm end-to-end operation: cargo oxide run vecadd

Verifying Installation: confirming a working toolchain

The verification flow typically includes:
cargo oxide doctor to validate toolchain health
cargo oxide run vecadd to compile a Rust kernel to PTX, launch it on the GPU, and verify results
A successful run prints a clear confirmation, such as “✓ SUCCESS: All 1024 elements correct!” indicating that the data produced on the device matches expectations.

Examples: a rich set of use cases

cuda-oxide ships with a broad collection of examples that showcase a variety of features and patterns:
vecadd: the canonical vector addition example, used to demonstrate the end-to-end path from Rust to PTX execution.
host_closure: generic kernels that illustrate passing closures from the host to the device.
generic: kernel templates that rely on monomorphization (e.g., scale) to demonstrate generic programming on the GPU.
gemm_sol: a GEMM (matrix multiplication) solution showcasing high-performance kernel organization and multiple phases.
tcgen05: Blackwell tensor cores (sm_100a) focusing on advanced memory and compute features like TMEM and MMA.
atomics: a suite of GPU atomic operations covering multiple types, scopes, and orderings.
cluster: thread-block clustering and dynamic DSMEM (distributed shared memory) ring exchanges, focusing on Hopper+ capabilities.
async_mlp: an asynchronous MLP pipeline that demonstrates concurrent computation across GEMM, matvec, and ReLU.
mathdxffitest: integration with cuFFTDx thread-level FFT and cuBLASDx block-level GEMM for advanced numerical workloads.
async_vecadd: asynchronous GPU execution using the cuda-async layer and DeviceOperation primitives.
crosscratekernel: demonstrating library crates that define kernels bundled into binaries.
Running examples
Typical commands follow the pattern: cargo oxide run vecadd or cargo oxide run gemm_sol
Crate overview: what you’ll find in the ecosystem
User-facing crates
- cuda-device: device intrinsics (thread, warp), barriers, and related primitives.
- cuda-host: host-side loading of modules, launching helpers, and LTOIR loading.
- cuda-macros: procedural macros for cudamodule, kernel, and gpuprintf inlining.
- cuda-bindings: raw FFI bindings to the CUDA C headers through bindgen.
- cuda-core: safe RAII wrappers around core runtime elements like CudaContext, CudaStream, DeviceBuffer, and PinnedHostBuffer.
- cuda-async: an asynchronous execution layer with DeviceOperation, DeviceFuture, and related types.
- libnvvm-sys: dlopen bindings to libNVVM used by the host’s LTOIR path.
- nvjitlink-sys: dlopen bindings to nvJitLink for device-side linking.
Compiler crates
- rustc-codegen-cuda: the custom rustc backend that emits CUDA code.
- mir-importer, mir-lower, dialect-mir, dialect-llvm, dialect-nvvm: a suite of crates that model and transform Rust MIR into a pipeline that targets LLVM and NVVM intrinsics.
Build tooling
- cargo-oxide: the cargo subcommand that drives the full toolchain (run, build, debug, etc.).
Documentation
- cuda-oxide-book: the project book (the main reference for compiler internals, kernel authoring, and API usage) built with Sphinx + MyST.
The roadmap includes LTOIR generation for Blackwell+ devices, device FFI interop via LTOIR, and deeper integration with MathDx and cuFFTDx ecosystems.

Architecture and workflow: how the pieces fit together

The architecture centers on a unified pipeline that bridges Rust code and CUDA execution. The journey begins with Rust code annotated for device compilation and ends with PTX that the NVIDIA toolchain can execute on the GPU.
A key architectural promise is to enable safe, expressive GPU kernels that leverage Rust’s type system, ownership model, and error handling semantics while still delivering the raw performance demanded by CUDA-level workloads.
The Pliron-based pipeline acts as the intermediary representation (IR) that enables robust lowering and optimization opportunities before emitting PTX. This approach allows more sophisticated transformations and potential cross-language interactions to be explored in future iterations.
Device-side abstractions and host-side runtimes are designed to be ergonomic without sacrificing control. Developers can rely on type-safe indexing and memory management primitives on the device while using high-level, RAII-style constructs on the host.

Documentation and learning resources

The primary reference is the cuda-oxide book, which documents:
SIMT kernel authoring in Rust
Synchronous and asynchronous GPU programming patterns
The compiler architecture and its internal representations
Practical examples and recipes for common workloads
The book is designed to be built and served locally. The repository provides a README that guides you through building and running the book locally, including instructions for a development environment that matches the project’s toolchain requirements.
Ecosystem and relationships with other Rust+GPU projects are discussed in the Ecosystem appendix, situating cuda-oxide within a broader effort to move GPU computing forward in Rust.

Ecosystem and collaboration

CUDA + Rust is a space with multiple efforts, each addressing different aspects of the problem—graphics pipelines, offloading strategies, language bindings, and code generation.
cuda-oxide is positioned to complement and interact with other Rust GPU initiatives, offering a native path for writing CUDA kernels in Rust and integrating with a Rust-first toolchain and ecosystem.
The project invites feedback, experimentation, and collaboration as part of its growth, with a clear emphasis on user experience, correctness, and performance.

Licensing and governance

Licensing information clarifies the open nature of the project and the licensing terms for its crates:
The cuda-bindings crate is licensed under the NVIDIA Software License: LICENSE-NVIDIA.
All other crates are licensed under the Apache License, Version 2.0: LICENSE-APACHE.
Clear licensing and governance help contributors understand how code from the project can be used in other contexts, and how contributions will be integrated.

What to expect as you experiment

Given its alpha status, you should anticipate API changes, refinements, and occasional breakage as the development team experiments with features, optimizations, and user feedback.
Early adopters can gain valuable insights by running examples, trying different workloads, and reporting anything that seems unclear, inefficient, or incorrect.
The project’s architecture is designed to welcome experimentation and iteration. The combination of a Rust-based kernel authoring experience, a robust host runtime, and a forward-looking IR pipeline provides a fertile ground for exploring GPU computing patterns entirely within Rust.

Conclusion: a forward-looking path for Rust GPU programming

cuda-oxide represents a bold step toward enabling native Rust programming for CUDA GPUs. By merging single-source compilation, a Rust-centric codegen backend, device-side abstractions, and a host runtime, the project offers a cohesive, end-to-end workflow for GPU kernels written in pure Rust.
While the project is in early stages, its ambition—end-to-end Rust-to-PTX, rich device abstractions, cross-crate kernel support, and an MLIR-like IR pipeline—paints a compelling vision for the future of GPU computing in the Rust ecosystem.
If you’re curious about pushing the boundaries of Rust on GPUs or contributing to a novel compiler backend, cuda-oxide invites you to explore, experiment, and contribute. The combined power of CUDA, LLVM, Rust, and an IR-driven pipeline holds the potential to unlock new paradigms for performance and safety in GPU programming.

Appendix: quick links and resources

Learn more about the project book and internal architecture at the cuda-oxide book (documented guides, internal API references, and tutorials).
Explore the ecosystem appendix to understand how cuda-oxide fits within the broader Rust+GPU landscape.
See LICENSE-NVIDIA and LICENSE-APACHE for licensing details.

Images and branding

The blog post begins with a row of project badges and the cuda-oxide logo, reinforcing the project identity and its continuous integration footprint. The badges (clippy, unit-tests, cargo-deny, CodeQL) reflect the automated checks that accompany development, while the logo anchors the brand visually for readers new to the project.

If you’d like, I can tailor this post to a particular audience (core contributors, potential users, or Rust enthusiasts) or expand any section with more in-depth explanations, diagrams, or example narratives.

Enjoying this project?

GitHub - NVlabs/cuda-oxide: cuda-oxide

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category