CUDA vs. UDA for Laniakea OS — and the Road to QPU APIs
- Erick Rosado

- Sep 20
- 5 min read

TL;DR
CUDA = vendor-specific (NVIDIA) GPU platform with best-in-class tooling and performance.
UDA (Unified Device Architecture) = vendor-neutral idea: one interface for many accelerators (CPU/GPU/FPGA/…QPU).
Laniakea OS support both: CUDA for peak GPU speed and UDA-style abstraction for portability today and QPU integration tomorrow.
1) What is CUDA?
NVIDIA’s Compute Unified Device Architecture: a programming model (kernels, grids/blocks/threads), compiler toolchain (nvcc), driver/runtime, and libraries (cuBLAS, cuDNN, NCCL, Thrust…) that expose massive GPU parallelism for general-purpose compute.
Why teams pick it
Mature ecosystem and profilers (Nsight), highly optimized libs, broad cloud/on-prem availability.
Tight control over memory hierarchy (global/shared/constant) and occupancy.
2) What is “UDA” (Unified Device Architecture)?
A concept rather than a single product: a unified API that can target diverse accelerators (CPU, NVIDIA/AMD GPU, FPGA, DSP—and eventually QPU). Think of it as the portability layer that keeps app code stable while the backend changes.
Concrete realizations of “UDA-like” portability include:
SYCL / oneAPI (Khronos/Intel): C++ single-source kernels, backends for CPU, Level-Zero, CUDA, HIP.
OpenCL: cross-vendor compute API (lower-level).
HIP/ROCm (AMD): CUDA-like model for AMD GPUs, with some CUDA translation.
ML frameworks runtimes that dispatch to multiple backends.
3) Why both matter to Laniakea OS
Performance now (CUDA): When NVIDIA GPUs are present, CUDA provides the fastest path.
Flexibility forever (UDA): A portability layer lets Laniakea schedule the same job on CPUs/GPUs/FPGAs—and future QPUs—without app rewrites.
Operations: A unified telemetry schema (utilization, memory, errors, energy) enables one scheduler to optimize placement across all devices.
4) Side-by-side comparison
5) Laniakea OS: runtime & telemetry blueprint
Scheduler responsibilities
Discover devices (CPU/GPU/FPGA/QPU) → capabilities (SMs/CU, mem, drivers, features).
Normalize telemetry (utilization, mem pressure, temp, power, errors, queue depth).
Match workload → backend: dense_linear_algebra → CUDA/cuBLAS, bit-exact streaming → FPGA, combinatorial search → QPU(hybrid)+CPU.
Unified telemetry schema (sample)
6) “Workload → Accelerator” quick guide
Note: choose by measured perf; above are starting heuristics.
7) The road to QPU APIs
8) Example graph
Relative throughput (illustrative) across backendsBaseline CPU = 1.0; higher is better (not real benchmarks)
Vector Math (BLAS1)
CPU |████████████ (1.0)
CUDA |██████████████████████████████ (8.0)
ROCm/HIP |██████████████████████████ (6.5)
FPGA |███████████ (3.0)
QPU(h) |██ (0.2) [classical task; QPU not ideal]
Tensor Convolution
CPU |████████████ (1.0)
CUDA |██████████████████████████████████████ (10.0)
ROCm/HIP |███████████████████████████████ (7.5)
FPGA |██████████████ (4.0)
QPU(h) |██ (0.2)
Graph Traversal (irregular)
CPU |████████████ (1.0)
CUDA |████████████████████ (3.5)
ROCm/HIP |██████████████████ (3.0)
FPGA |███████████████ (3.8)
QPU(h) |███ (0.3)
Note: Values are illustrative to visualize trade-offs; always measure on your hardware.
9) Two tables you can paste into docs
A) Device inventory → backend handle
B) Scheduler decision hints
10) Minimal pseudo-API (portable compute call)
11) Practical guidance for Laniakea OS
Support CUDA directly where present; keep UDA-style path for everything else.
Unify telemetry early; scheduling wins come from shared metrics.
Abstract jobs (not devices): express intent (dense, sparse, bit-level, hybrid-quantum).
Prepare for QPUs via a clean async “submit/await” interface and hybrid loops.
Continuously benchmark; keep routing decisions empirical, not dogmatic.
12) One-page glossary
CUDA: NVIDIA platform for general GPU compute.
UDA (concept): one API for many accelerators.
SYCL/oneAPI: portable C++ single-source model targeting multiple backends.
HIP/ROCm: AMD’s CUDA-like stack.
QPU: Quantum Processing Unit (gate-model/annealer).
Hybrid: classical control loop + quantum subroutines.
Telemetry: normalized device metrics for scheduling.
| Device Inventory → Backend Handle | Example ||---|---|---|| CPU | cpu:0 || NVIDIA | cuda:0 || AMD | rocm:1 || FPGA | fpga:xcvu9p-0 || QPU (remote) | qpu:rigetti:Ankaa-9Q |
















Comments