BarraCUDA - Open-Source CUDA C++ Compiler

Overview

BarraCUDA is described as an open-source CUDA C++ compiler, written from scratch in C99, designed to convert .cu CUDA source files into multiple target formats including AMD RDNA (AMD GPU binaries), NVIDIA PTX, and Tenstorrent Tensix C++, with ambitions for additional architectures.
It explicitly states no LLVM dependency, no external dependencies, and no permission gating. The project frames itself as a response to perceived restrictions within NVIDIA’s ecosystem, highlighting the intention to create a self-contained toolchain.
The core message emphasizes practicality and independence: it provides its own instruction encoding and lowers reliance on existing toolchains, and it invites readers to consult the Changelog for recent progress.

Visual Diagram (ASCII Illustration)

The document includes a detailed ASCII diagram titled BarraCUDA Pipeline, representing the compiler’s internal stages and backend targets. This “image” outlines the flow from source input through preprocessing, lexing, parsing, semantic analysis, an intermediate representation, register promotion, instruction selection, and backend-specific generation to AMD, NVIDIA, or Tenstorrent targets. The diagram emphasizes a clear, staged pipeline and shows how the BarraCUDA IR (BIR) sits between parsing and code emission, with separate paths to the various backends.

What It Does

BarraCUDA accepts CUDA C source code (.cu files) and aims to produce target-specific binaries for multiple GPUs and runtimes:
AMD RDNA 2/3/4 binaries (GFX architectures)
NVIDIA PTX text
Tenstorrent Tensix Metalium C++
The pipeline is designed around a compact, C99-based implementation. It includes:
Preprocessing (handling includes, defines, macros, and conditionals)
Lexing (tokenization)
Parsing (Recursive Descent)
Semantic analysis (type checking and scope resolution)
BarraCUDA IR (BIR) in SSA form with typed instructions
mem2reg pass to promote allocas to SSA registers
Instruction selection across the three backends (AMD, NVIDIA, Tenstorrent)
Backend-specific emission: VGPR/SGPR register allocation, isel and emission for PTX, Tensix SFPU isel, and final binary encoding
Final outputs: .hsaco (AMD), .ptx (NVIDIA), and Tensix C++ sources for Tenstorrent
The diagram also notes the path to output and loader behavior, including how the GPU-specific binaries are produced and prepared for runtime.
The project emphasizes that it does not rely on LLVM and handles its own instruction encoding.

Building

BarraCUDA is promoted as a lightweight build with minimal requirements:
It is written in C99 and builds with a standard C toolchain
The build system provides a single command: make
There are no dependencies beyond a working C99 compiler
The project explicitizes that there is no CMake, no Autoconf, and no long multi-step build process
A bit of humor is injected: if the build doesn’t work, it’s suggested that the GCC toolchain is the problem, not the Makefile
Practical requirements summarized:
A C99 compiler (e.g., gcc, clang)
A willingness to live with the project’s dependency-free design
Note: LLVM is not required; BarraCUDA implements its own instruction encoding

Usage

BarraCUDA provides multiple invocation modes to target different backends:
Compile to AMD GPU binaries (RDNA 3 by default)
- Example: ./barracuda --amdgpu-bin kernel.cu -o kernel.hsaco
Compile for RDNA 2
- Example: ./barracuda --amdgpu-bin --gfx1030 kernel.cu -o kernel.hsaco
Compile for RDNA 4
- Example: ./barracuda --amdgpu-bin --gfx1200 kernel.cu -o kernel.hsaco
Compile to NVIDIA PTX
- Example: ./barracuda --nvidia-ptx kernel.cu -o kernel.ptx
Compile to Tenstorrent Tensix
- Example: ./barracuda --tensix kernel.cu -o kernel_compute.cpp
Dump the IR (for debugging or curiosity)
- Example: ./barracuda --ir kernel.cu
Just parse and dump the AST
- Example: ./barracuda --ast kernel.cu
Run semantic analysis
- Example: ./barracuda --sema kernel.cu
Multilingual error messages
- Example: ./barracuda --lang lang/mi.txt --amdgpu-bin kernel.cu -o kernel.hsaco
These usage patterns illustrate a flexible toolchain that supports both code generation and introspection of the compilation process, as well as localized error reporting.

Runtime Launcher

BarraCUDA ships with a minimal HSA runtime (src/runtime/) intended for dispatching compiled kernels onto real AMD hardware:
There is zero compile-time dependency on ROCm; instead, it loads libhsa-runtime64.so at runtime using dlopen
A compact example command sequence shows compiling the runtime and a launcher together:
- gcc -std=c99 -O2 -I src/runtime examples/launchsaxpy.c src/runtime/bcruntime.c -ldl -lm -o launch_saxpy
- ./barracuda --amdgpu-bin -o test.hsaco tests/canonical.cu
- ./launch_saxpy test.hsaco
The runtime launcher approach implies that, to run code on AMD hardware, ROCm must be installed on the host system. The example emphasizes Linux environments and provides a concrete demonstration of compiling and executing a kernel using the provided runtime integration.

What Works

BarraCUDA asserts support for a broad set of CUDA features and constructs, with working results on multiple architectures:
Core language features:
- Function qualifiers: global, device, host
- Built-ins: threadIdx, blockIdx, blockDim, gridDim
- Structs, enums, typedefs, namespaces
- Pointers, arrays, pointer arithmetic
- Full C control flow constructs: if/else, for, while, do-while, switch/case, goto/label
- Short-circuit logical operators (&&, ||)
- Ternary operator
- Templates (basic instantiation)
- Multiple return paths, continue, break
CUDA features:
- shared memory and LDS-backed allocation
- _syncthreads() mapped to sbarrier
- Atomic operations: atomicAdd, atomicSub, atomicMin, atomicMax, atomicExch, atomicCAS, atomicAnd, atomicOr, atomicXor
- Warp intrinsics: _shflsync, _shflup_sync, _shfldown_sync, _shflxor_sync
- Warp votes: _ballotsync, _anysync, _allsync
- Vector types: float2/float3/float4, int2/int3/int4 with component accessors
- Half precision: __half, __float2half(), __half2float(), _nvbfloat16
- launch_bounds (parsed, propagated, enforces VGPR caps)
- Cooperative groups: cooperativegroups::thisthreadblock().sync(), .threadrank(), .size()
- Operator overloading, math builtins (sqrtf, rsqrtf, expf, exp2f, logf, log2f, log10f, sinf, cosf, tanf, tanhf, powf, fabsf, floorf, ceilf, truncf, roundf, rintf, fmaxf, fminf, fmodf, copysignf)
- constant memory and device globals
Compiler features:
- Full C preprocessor (include, define/undef, function-like macros, #ifdef/#ifndef/#if/#elif/#else/#endif, #pragma, #error, -I/-D flags)
- Error recovery that reports multiple errors without cascading
- Multilingual error messages via --lang, with language-neutral error codes
- Source location tracking in IR dumps
- Struct pass-by-value semantics
Example provided demonstrates a vector_add kernel and its compilation to a .hsaco file, highlighting a straightforward workflow without the need for LLVM.

Example

The canonical CUDA example is a vector addition kernel:
Kernel code example:
- global void vector_add(float *c, float *a, float *b, int n) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) c[idx] = a[idx] + b[idx]; }
A sample execution flow shows compiling to AMD binary and producing a .hsaco file:
Command: ./barracuda --amdgpu-bin vectoradd.cu -o vectoradd.hsaco
Result: wrote vector_add.hsaco (528 bytes code, 1 kernels)
The absence of an LLVM requirement is emphasized as a design feature.

Validated on Hardware

BarraCUDA-compiled kernels have been tested on real hardware:
AMD MI300X (CDNA3, GFX942) — 8/8 test kernels passing; Monte Carlo neutron transport achieving correct physics (k_eff = 0.995)
AMD RDNA3 (GFX1100) — Full test suite passing via RDNA3 emulator CI
NVIDIA RTX 4060 Ti — PTX backend, loaded via CUDA Driver API, JIT-compiled by the NVIDIA driver; Monte Carlo neutron transport benchmark shows speedup over CPU
Tenstorrent Blackhole — Compiles to valid Metalium C++; hardware validation pending dev kit access
The results underscore cross-vendor ambition and practical verification against reference physics benchmarks.

What Doesn’t Work (Yet)

The project is transparent about current gaps and ongoing work:
Parameter reassignment in device functions (requires local variables usage)
Textures and surfaces
Dynamic parallelism (device-side kernel launches)
Multiple translation units
Host code generation (only device code is compiled)
The note clarifies that these limitations are not fundamental blockers but items yet to be implemented.

Test Suite

A modest but meaningful test corpus accompanies the project:
14 test files, 35+ kernels, ~1,700 BIR instructions, ~27,000 bytes of machine code
Examples include vectoradd.cu (the hello world kernel), cudafeatures.cu (atomics, warp ops, barriers, control flow constructs), testtier12.cu (vectors, shared memory, operator overloading), notgpt.cu (sarcastic AI-generated CUDA with diverse features), stress.cu (N-body simulation, complex control flows, bit manipulation), canonical.cu (NVIDIA sample patterns adapted for the parser), testerrors.cu (deliberate syntax errors for error recovery), testlaunchbounds.cu (launchbounds parsing and VGPR cap enforcement), testcoop_groups.cu (cooperative groups lowering), mymathhomework.cu (trig identities, Newton-Raphson, etc.), plus tests for preprocessor, templates, and unsigned integer handling
The curated test set demonstrates a breadth of CUDA features and parser/lowerer coverage.

Roadmap

Near Term: Hardening
Targeted fixes for known gaps (integer literal suffixes, const correctness, parameter reassignment)
Objective: compile real-world .cu files without source-level modifications
Medium Term: Optimization
Current progress includes instruction scheduling, constant folding, dead code elimination, and divergence-aware SSA register allocation
Priorities include loop-invariant code motion and occupancy tuning based on register pressure
Long Term: More Architectures
The IR (BIR) is architecture-agnostic, with a clean backend separation
Adding a new target requires implementing a new isel and emit pair
Proposed expansions:
- NVIDIA PTX (already done)
- Tenstorrent Tensix (already done)
- Intel Arc (Xe)
- RISC-V Vector Extension
The roadmap conveys a pragmatic, staged approach to expansion and optimization.

Contributing

The project welcomes issues and pull requests in any language, with a requirement to include an English translation alongside.
See CONTRIBUTING.md for guidelines on style, naming, and contribution processes.
The project notes that some identifiers (HLASM-style short labels like ragc, mkhash, enc_vop3) are culturally neutral by accident, underscoring a language-agnostic naming approach.
The contributor invitation emphasizes inclusive collaboration and the value of diverse linguistic backgrounds in debugging and improvement.

Changelog

A running log illustrates the project’s rapid development through 2026:
2026-03-18 — NVIDIA PTX backend (--nvidia-ptx): CUDA to PTX text, loaded via CUDA Driver API and JIT-compiled by NVIDIA; validated on RTX 4060 Ti; anonymous struct/union support added in parser, sema, and lowerer
2026-03-14 — Divergence-aware SSA register allocator (--ssa-ra): dramatically reduces VGPR spills on Monte Carlo kernels; substantial performance and memory savings; ~1,300 lines of C99
2026-03-09 — Post-isel verification pass (bc_vfy): validates instructions post-isel and post-RA; identified and fixed multiple encoding issues; improved diagnostics with post-mortem-style tooling
2026-03-08 — Error localization infrastructure: language-neutral IDs (E001–E111), external translation files, unified error structures
2026-03-05 — CDNA 3 additions: GFX942 hardening, MFMA, Wave64 divergence, Tinygrad compatibility; 8/8 tests on MI300X
2026-03-05 — Instruction scheduling
2026-03-03 — CDNA 2 support (--gfx90a, MI250) and Tinygrad compatibility
2026-02-28 — Tenstorrent Tensix backend (--tensix): CUDA to TT-Metalium C++; constant folding; dead code elimination
2026-02-25 — HSA runtime launcher; RDNA 2 support (--gfx1030); test suite
2026-02-20 — RDNA 4 support (--gfx1200)
2026-02-16 — Initial release: CUDA compiler targeting AMD RDNA 3 (gfx1100)
The changelog documents a trajectory of backend expansion, correctness improvements, and runtime tooling enhancements.

Contact

If you encounter bugs or wish to discuss details of AMDGPU instruction encoding, the project provides contact options:
Email: zanehambly@gmail.com
Open an issue for discussion or collaboration
The author identifies as based in New Zealand and emphasizes a candid, informal stance about development progress and collaboration.

License

BarraCUDA is released under the Apache 2.0 license, enabling broad use, modification, and distribution rights.

Acknowledgements

The project expresses gratitude to several groups and individuals:
Fernando Magno Quintão Pereira and the Compilers Lab at UFMG for guidance and divergence analysis ideas
The broader academic community (Cooper, Harvey & Kennedy; Braun & Hack; Sampaio, Souza, Collange & Pereira) for foundational compiler ideas
Steven Muchnick for Advanced Compiler Design and Implementation as an influential reference
Low Level for educational content that helped in learning C
Abe Kornelis for teaching and the z390 Portable Mainframe Assembler project inspiration
The supportive community for feedback and encouragement
The author’s family for personal support

Note on Images from the Input

The primary “image” included in the input is the BarraCUDA Pipeline ASCII diagram, which has been integrated into this description as a central visual element. It is presented here as a labeled, readable diagram to convey the compiler’s staged workflow and cross-backend structure without embedding external image files.

BarraCUDA: Open-Source CUDA C++ Compiler

Enjoying this project?

GitHub - Zaneham/BarraCUDA: BarraCUDA: Open-Source CUDA C++ Compiler

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category