GitHub Repo
Apache 2.0
April 19, 2026 at 12:01 PM0 views
BarraCUDA: Open-Source CUDA C++ Compiler
@ZanehamProject Author
- Overview
- BarraCUDA is described as an open-source CUDA C++ compiler, written from scratch in C99, designed to convert .cu CUDA source files into multiple target formats including AMD RDNA (AMD GPU binaries), NVIDIA PTX, and Tenstorrent Tensix C++, with ambitions for additional architectures.
- It explicitly states no LLVM dependency, no external dependencies, and no permission gating. The project frames itself as a response to perceived restrictions within NVIDIA’s ecosystem, highlighting the intention to create a self-contained toolchain.
- The core message emphasizes practicality and independence: it provides its own instruction encoding and lowers reliance on existing toolchains, and it invites readers to consult the Changelog for recent progress.
- Visual Diagram (ASCII Illustration)
- The document includes a detailed ASCII diagram titled BarraCUDA Pipeline, representing the compiler’s internal stages and backend targets. This “image” outlines the flow from source input through preprocessing, lexing, parsing, semantic analysis, an intermediate representation, register promotion, instruction selection, and backend-specific generation to AMD, NVIDIA, or Tenstorrent targets. The diagram emphasizes a clear, staged pipeline and shows how the BarraCUDA IR (BIR) sits between parsing and code emission, with separate paths to the various backends.
- What It Does
- BarraCUDA accepts CUDA C source code (.cu files) and aims to produce target-specific binaries for multiple GPUs and runtimes:
- AMD RDNA 2/3/4 binaries (GFX architectures)
- NVIDIA PTX text
- Tenstorrent Tensix Metalium C++
- The pipeline is designed around a compact, C99-based implementation. It includes:
- Preprocessing (handling includes, defines, macros, and conditionals)
- Lexing (tokenization)
- Parsing (Recursive Descent)
- Semantic analysis (type checking and scope resolution)
- BarraCUDA IR (BIR) in SSA form with typed instructions
- mem2reg pass to promote allocas to SSA registers
- Instruction selection across the three backends (AMD, NVIDIA, Tenstorrent)
- Backend-specific emission: VGPR/SGPR register allocation, isel and emission for PTX, Tensix SFPU isel, and final binary encoding
- Final outputs: .hsaco (AMD), .ptx (NVIDIA), and Tensix C++ sources for Tenstorrent
- The diagram also notes the path to output and loader behavior, including how the GPU-specific binaries are produced and prepared for runtime.
- The project emphasizes that it does not rely on LLVM and handles its own instruction encoding.
- Building
- BarraCUDA is promoted as a lightweight build with minimal requirements:
- It is written in C99 and builds with a standard C toolchain
- The build system provides a single command: make
- There are no dependencies beyond a working C99 compiler
- The project explicitizes that there is no CMake, no Autoconf, and no long multi-step build process
- A bit of humor is injected: if the build doesn’t work, it’s suggested that the GCC toolchain is the problem, not the Makefile
- Practical requirements summarized:
- A C99 compiler (e.g., gcc, clang)
- A willingness to live with the project’s dependency-free design
- Note: LLVM is not required; BarraCUDA implements its own instruction encoding
- Usage
- BarraCUDA provides multiple invocation modes to target different backends:
- Compile to AMD GPU binaries (RDNA 3 by default)
- Example: ./barracuda --amdgpu-bin kernel.cu -o kernel.hsaco
- Compile for RDNA 2
- Example: ./barracuda --amdgpu-bin --gfx1030 kernel.cu -o kernel.hsaco
- Compile for RDNA 4
- Example: ./barracuda --amdgpu-bin --gfx1200 kernel.cu -o kernel.hsaco
- Compile to NVIDIA PTX
- Example: ./barracuda --nvidia-ptx kernel.cu -o kernel.ptx
- Compile to Tenstorrent Tensix
- Example: ./barracuda --tensix kernel.cu -o kernel_compute.cpp
- Dump the IR (for debugging or curiosity)
- Example: ./barracuda --ir kernel.cu
- Just parse and dump the AST
- Example: ./barracuda --ast kernel.cu
- Run semantic analysis
- Example: ./barracuda --sema kernel.cu
- Multilingual error messages
- Example: ./barracuda --lang lang/mi.txt --amdgpu-bin kernel.cu -o kernel.hsaco
- These usage patterns illustrate a flexible toolchain that supports both code generation and introspection of the compilation process, as well as localized error reporting.
- Runtime Launcher
- BarraCUDA ships with a minimal HSA runtime (src/runtime/) intended for dispatching compiled kernels onto real AMD hardware:
- There is zero compile-time dependency on ROCm; instead, it loads libhsa-runtime64.so at runtime using dlopen
- A compact example command sequence shows compiling the runtime and a launcher together:
- gcc -std=c99 -O2 -I src/runtime examples/launchsaxpy.c src/runtime/bcruntime.c -ldl -lm -o launch_saxpy
- ./barracuda --amdgpu-bin -o test.hsaco tests/canonical.cu
- ./launch_saxpy test.hsaco
- The runtime launcher approach implies that, to run code on AMD hardware, ROCm must be installed on the host system. The example emphasizes Linux environments and provides a concrete demonstration of compiling and executing a kernel using the provided runtime integration.
- What Works
- BarraCUDA asserts support for a broad set of CUDA features and constructs, with working results on multiple architectures:
- Core language features:
- Function qualifiers: global, device, host
- Built-ins: threadIdx, blockIdx, blockDim, gridDim
- Structs, enums, typedefs, namespaces
- Pointers, arrays, pointer arithmetic
- Full C control flow constructs: if/else, for, while, do-while, switch/case, goto/label
- Short-circuit logical operators (&&, ||)
- Ternary operator
- Templates (basic instantiation)
- Multiple return paths, continue, break
- CUDA features:
- shared memory and LDS-backed allocation
- _syncthreads() mapped to sbarrier
- Atomic operations: atomicAdd, atomicSub, atomicMin, atomicMax, atomicExch, atomicCAS, atomicAnd, atomicOr, atomicXor
- Warp intrinsics: _shflsync, _shflup_sync, _shfldown_sync, _shflxor_sync
- Warp votes: _ballotsync, _anysync, _allsync
- Vector types: float2/float3/float4, int2/int3/int4 with component accessors
- Half precision: __half, __float2half(), __half2float(), _nvbfloat16
- launch_bounds (parsed, propagated, enforces VGPR caps)
- Cooperative groups: cooperativegroups::thisthreadblock().sync(), .threadrank(), .size()
- Operator overloading, math builtins (sqrtf, rsqrtf, expf, exp2f, logf, log2f, log10f, sinf, cosf, tanf, tanhf, powf, fabsf, floorf, ceilf, truncf, roundf, rintf, fmaxf, fminf, fmodf, copysignf)
- constant memory and device globals
- Compiler features:
- Full C preprocessor (include, define/undef, function-like macros, #ifdef/#ifndef/#if/#elif/#else/#endif, #pragma, #error, -I/-D flags)
- Error recovery that reports multiple errors without cascading
- Multilingual error messages via --lang, with language-neutral error codes
- Source location tracking in IR dumps
- Struct pass-by-value semantics
- Example provided demonstrates a vector_add kernel and its compilation to a .hsaco file, highlighting a straightforward workflow without the need for LLVM.
- Example
- The canonical CUDA example is a vector addition kernel:
- Kernel code example:
- global void vector_add(float *c, float *a, float *b, int n) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) c[idx] = a[idx] + b[idx]; }
- A sample execution flow shows compiling to AMD binary and producing a .hsaco file:
- Command: ./barracuda --amdgpu-bin vectoradd.cu -o vectoradd.hsaco
- Result: wrote vector_add.hsaco (528 bytes code, 1 kernels)
- The absence of an LLVM requirement is emphasized as a design feature.
- Validated on Hardware
- BarraCUDA-compiled kernels have been tested on real hardware:
- AMD MI300X (CDNA3, GFX942) — 8/8 test kernels passing; Monte Carlo neutron transport achieving correct physics (k_eff = 0.995)
- AMD RDNA3 (GFX1100) — Full test suite passing via RDNA3 emulator CI
- NVIDIA RTX 4060 Ti — PTX backend, loaded via CUDA Driver API, JIT-compiled by the NVIDIA driver; Monte Carlo neutron transport benchmark shows speedup over CPU
- Tenstorrent Blackhole — Compiles to valid Metalium C++; hardware validation pending dev kit access
- The results underscore cross-vendor ambition and practical verification against reference physics benchmarks.
- What Doesn’t Work (Yet)
- The project is transparent about current gaps and ongoing work:
- Parameter reassignment in device functions (requires local variables usage)
- Textures and surfaces
- Dynamic parallelism (device-side kernel launches)
- Multiple translation units
- Host code generation (only device code is compiled)
- The note clarifies that these limitations are not fundamental blockers but items yet to be implemented.
- Test Suite
- A modest but meaningful test corpus accompanies the project:
- 14 test files, 35+ kernels, ~1,700 BIR instructions, ~27,000 bytes of machine code
- Examples include vectoradd.cu (the hello world kernel), cudafeatures.cu (atomics, warp ops, barriers, control flow constructs), testtier12.cu (vectors, shared memory, operator overloading), notgpt.cu (sarcastic AI-generated CUDA with diverse features), stress.cu (N-body simulation, complex control flows, bit manipulation), canonical.cu (NVIDIA sample patterns adapted for the parser), testerrors.cu (deliberate syntax errors for error recovery), testlaunchbounds.cu (launchbounds parsing and VGPR cap enforcement), testcoop_groups.cu (cooperative groups lowering), mymathhomework.cu (trig identities, Newton-Raphson, etc.), plus tests for preprocessor, templates, and unsigned integer handling
- The curated test set demonstrates a breadth of CUDA features and parser/lowerer coverage.
- Roadmap
- Near Term: Hardening
- Targeted fixes for known gaps (integer literal suffixes, const correctness, parameter reassignment)
- Objective: compile real-world .cu files without source-level modifications
- Medium Term: Optimization
- Current progress includes instruction scheduling, constant folding, dead code elimination, and divergence-aware SSA register allocation
- Priorities include loop-invariant code motion and occupancy tuning based on register pressure
- Long Term: More Architectures
- The IR (BIR) is architecture-agnostic, with a clean backend separation
- Adding a new target requires implementing a new isel and emit pair
- Proposed expansions:
- NVIDIA PTX (already done)
- Tenstorrent Tensix (already done)
- Intel Arc (Xe)
- RISC-V Vector Extension
- The roadmap conveys a pragmatic, staged approach to expansion and optimization.
- Contributing
- The project welcomes issues and pull requests in any language, with a requirement to include an English translation alongside.
- See CONTRIBUTING.md for guidelines on style, naming, and contribution processes.
- The project notes that some identifiers (HLASM-style short labels like ragc, mkhash, enc_vop3) are culturally neutral by accident, underscoring a language-agnostic naming approach.
- The contributor invitation emphasizes inclusive collaboration and the value of diverse linguistic backgrounds in debugging and improvement.
- Changelog
- A running log illustrates the project’s rapid development through 2026:
- 2026-03-18 — NVIDIA PTX backend (--nvidia-ptx): CUDA to PTX text, loaded via CUDA Driver API and JIT-compiled by NVIDIA; validated on RTX 4060 Ti; anonymous struct/union support added in parser, sema, and lowerer
- 2026-03-14 — Divergence-aware SSA register allocator (--ssa-ra): dramatically reduces VGPR spills on Monte Carlo kernels; substantial performance and memory savings; ~1,300 lines of C99
- 2026-03-09 — Post-isel verification pass (bc_vfy): validates instructions post-isel and post-RA; identified and fixed multiple encoding issues; improved diagnostics with post-mortem-style tooling
- 2026-03-08 — Error localization infrastructure: language-neutral IDs (E001–E111), external translation files, unified error structures
- 2026-03-05 — CDNA 3 additions: GFX942 hardening, MFMA, Wave64 divergence, Tinygrad compatibility; 8/8 tests on MI300X
- 2026-03-05 — Instruction scheduling
- 2026-03-03 — CDNA 2 support (--gfx90a, MI250) and Tinygrad compatibility
- 2026-02-28 — Tenstorrent Tensix backend (--tensix): CUDA to TT-Metalium C++; constant folding; dead code elimination
- 2026-02-25 — HSA runtime launcher; RDNA 2 support (--gfx1030); test suite
- 2026-02-20 — RDNA 4 support (--gfx1200)
- 2026-02-16 — Initial release: CUDA compiler targeting AMD RDNA 3 (gfx1100)
- The changelog documents a trajectory of backend expansion, correctness improvements, and runtime tooling enhancements.
- Contact
- If you encounter bugs or wish to discuss details of AMDGPU instruction encoding, the project provides contact options:
- Email: zanehambly@gmail.com
- Open an issue for discussion or collaboration
- The author identifies as based in New Zealand and emphasizes a candid, informal stance about development progress and collaboration.
- License
- BarraCUDA is released under the Apache 2.0 license, enabling broad use, modification, and distribution rights.
- Acknowledgements
- The project expresses gratitude to several groups and individuals:
- Fernando Magno Quintão Pereira and the Compilers Lab at UFMG for guidance and divergence analysis ideas
- The broader academic community (Cooper, Harvey & Kennedy; Braun & Hack; Sampaio, Souza, Collange & Pereira) for foundational compiler ideas
- Steven Muchnick for Advanced Compiler Design and Implementation as an influential reference
- Low Level for educational content that helped in learning C
- Abe Kornelis for teaching and the z390 Portable Mainframe Assembler project inspiration
- The supportive community for feedback and encouragement
- The author’s family for personal support
Note on Images from the Input
- The primary “image” included in the input is the BarraCUDA Pipeline ASCII diagram, which has been integrated into this description as a central visual element. It is presented here as a labeled, readable diagram to convey the compiler’s staged workflow and cross-backend structure without embedding external image files.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/Zaneham/BarraCUDA
GitHub - Zaneham/BarraCUDA: BarraCUDA: Open-Source CUDA C++ Compiler
BarraCUDA is an open-source CUDA C++ compiler written from scratch in C99 that converts .cu files into AMD RDNA, NVIDIA PTX, and Tenstorrent Tensix C++ binaries...
github - zaneham/barracuda
Project
barracuda
Created
April 19
Last Updated
April 19, 2026 at 12:01 PM