CUDA vs. UDA for Laniakea OS — and the Road to QPU APIs
- Erick Rosado 
- Sep 20
- 5 min read

TL;DR
- CUDA = vendor-specific (NVIDIA) GPU platform with best-in-class tooling and performance. 
- UDA (Unified Device Architecture) = vendor-neutral idea: one interface for many accelerators (CPU/GPU/FPGA/…QPU). 
- Laniakea OS support both: CUDA for peak GPU speed and UDA-style abstraction for portability today and QPU integration tomorrow. 
1) What is CUDA?
NVIDIA’s Compute Unified Device Architecture: a programming model (kernels, grids/blocks/threads), compiler toolchain (nvcc), driver/runtime, and libraries (cuBLAS, cuDNN, NCCL, Thrust…) that expose massive GPU parallelism for general-purpose compute.
Why teams pick it
- Mature ecosystem and profilers (Nsight), highly optimized libs, broad cloud/on-prem availability. 
- Tight control over memory hierarchy (global/shared/constant) and occupancy. 
2) What is “UDA” (Unified Device Architecture)?
A concept rather than a single product: a unified API that can target diverse accelerators (CPU, NVIDIA/AMD GPU, FPGA, DSP—and eventually QPU). Think of it as the portability layer that keeps app code stable while the backend changes.
Concrete realizations of “UDA-like” portability include:
- SYCL / oneAPI (Khronos/Intel): C++ single-source kernels, backends for CPU, Level-Zero, CUDA, HIP. 
- OpenCL: cross-vendor compute API (lower-level). 
- HIP/ROCm (AMD): CUDA-like model for AMD GPUs, with some CUDA translation. 
- ML frameworks runtimes that dispatch to multiple backends. 
3) Why both matter to Laniakea OS
- Performance now (CUDA): When NVIDIA GPUs are present, CUDA provides the fastest path. 
- Flexibility forever (UDA): A portability layer lets Laniakea schedule the same job on CPUs/GPUs/FPGAs—and future QPUs—without app rewrites. 
- Operations: A unified telemetry schema (utilization, memory, errors, energy) enables one scheduler to optimize placement across all devices. 
4) Side-by-side comparison
| Axis | CUDA (NVIDIA) | UDA-style (e.g., SYCL/oneAPI/OpenCL/HIP wrappers) | 
| Vendor scope | NVIDIA only | Multi-vendor / multi-device | 
| Performance | Peak on NVIDIA GPUs via tuned libs | Competitive; depends on backend vendor libraries | 
| Tooling | Nsight, nvprof, CUPTI, rich ecosystem | Improving (VTune/Advisor, Codeplay tools, ROCm tools), more variance | 
| Portability | Low (CUDA code) | High (single source, multiple targets) | 
| Maintenance | Duplicate paths for non-NVIDIA | One codebase; backend adapters | 
| Best use | NVIDIA-heavy fleets, latency-critical paths | Heterogeneous fleets, long-term portability, QPU runway | 
5) Laniakea OS: runtime & telemetry blueprint
Scheduler responsibilities
- Discover devices (CPU/GPU/FPGA/QPU) → capabilities (SMs/CU, mem, drivers, features). 
- Normalize telemetry (utilization, mem pressure, temp, power, errors, queue depth). 
- Match workload → backend: dense_linear_algebra → CUDA/cuBLAS, bit-exact streaming → FPGA, combinatorial search → QPU(hybrid)+CPU. 
Unified telemetry schema (sample)
| Metric | CPU | GPU | FPGA | QPU(hybrid) | Notes | 
| Utilization (%) | ✓ | ✓ | ✓ | ✓ | normalized per backend | 
| Mem used / total | ✓ | ✓ (HBM/VRAM) | ✓ | ✓ (shots cache) | |
| Temperature (°C) | ✓ | ✓ | ✓ | — | QPU uses dilution-fridge metrics instead | 
| Power (W) | ✓ | ✓ | ✓ | — | QPU has fridge power; abstract separately | 
| Error counters | ✓ | ✓ (ECC) | ✓ | ✓ (T1/T2 drift, readout err) | |
| Queue depth | ✓ | ✓ | ✓ | ✓ | back-pressure signal | 
6) “Workload → Accelerator” quick guide
| Workload pattern | Best default | Portable fallback | 
| Dense BLAS (GEMM/conv) | CUDA (cuBLAS/cuDNN) | oneAPI MKL / ROCm rocBLAS / CPU MKL | 
| Sparse / graph traversals | GPU (CUDA) or CPU if branchy | SYCL/OpenCL backends | 
| Streaming bit-level pipelines | FPGA | CPU SIMD / GPU custom kernels | 
| Combinatorial optimization / QAOA-like | Hybrid QPU + CPU/GPU | Classical heuristic meta-solvers | 
Note: choose by measured perf; above are starting heuristics.
7) The road to QPU APIs
| Layer | What it does | Today’s analog | Laniakea plan | 
| Circuit DSL | Express circuits/ansätze | OpenQASM, QIR | Ingest standard DSLs | 
| IR & transpile | Map to hardware topology | MLIR-Quantum, QIR, t | ket> | 
| Runtime | Submit circuits/jobs; manage shots | Cloud runtimes (gRPC/REST) | Unified “DeviceQueue” API | 
| Telemetry | Shots/s, queue depth, error rates, calib drift | Provider dashboards | Normalize to scheduler schema | 
| Hybrid control | Classical loop around quantum calls | Orchestration SDKs | Built-in hybrid tasks (async futures) | 
8) Example graph
Relative throughput (illustrative) across backendsBaseline CPU = 1.0; higher is better (not real benchmarks)
Vector Math (BLAS1)
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████ (8.0)
ROCm/HIP |██████████████████████████ (6.5)
FPGA     |███████████ (3.0)
QPU(h)   |██ (0.2)  [classical task; QPU not ideal]
Tensor Convolution
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████████████ (10.0)
ROCm/HIP |███████████████████████████████ (7.5)
FPGA     |██████████████ (4.0)
QPU(h)   |██ (0.2)
Graph Traversal (irregular)
CPU      |████████████ (1.0)
CUDA     |████████████████████ (3.5)
ROCm/HIP |██████████████████ (3.0)
FPGA     |███████████████ (3.8)
QPU(h)   |███ (0.3)
Note: Values are illustrative to visualize trade-offs; always measure on your hardware.
9) Two tables you can paste into docs
A) Device inventory → backend handle
| Device type | Probe | Backend handle | Notes | 
| CPU | /proc/cpuinfo, CPUID, sysfs | uda://cpu/0 | NUMA & SIMD flags | 
| NVIDIA GPU | CUDA runtime cudaGetDeviceProperties | cuda://gpu/0 | SMs, HBM size | 
| AMD GPU | ROCm SMI / HIP | hip://gpu/0 | CUs, HBM size | 
| FPGA | OpenCL / vendor SMI | ocl://fpga/0 | bitstream ID | 
| QPU | Provider gRPC/REST | qpu://provider/systemA | topology, calib time | 
B) Scheduler decision hints
| Hint | Source | Effect | 
| job.tags includes "dense-blas" | user/job meta | prefer CUDA/ROCm | 
| power_cap < X | energy policy | prefer CPU/FPGA | 
| qpu.calib_age > threshold | QPU telemetry | delay quantum jobs | 
| gpu.mem_free < model_mem | GPU telemetry | spill or split batch | 
10) Minimal pseudo-API (portable compute call)
11) Practical guidance for Laniakea OS
- Support CUDA directly where present; keep UDA-style path for everything else. 
- Unify telemetry early; scheduling wins come from shared metrics. 
- Abstract jobs (not devices): express intent (dense, sparse, bit-level, hybrid-quantum). 
- Prepare for QPUs via a clean async “submit/await” interface and hybrid loops. 
- Continuously benchmark; keep routing decisions empirical, not dogmatic. 
12) One-page glossary
- CUDA: NVIDIA platform for general GPU compute. 
- UDA (concept): one API for many accelerators. 
- SYCL/oneAPI: portable C++ single-source model targeting multiple backends. 
- HIP/ROCm: AMD’s CUDA-like stack. 
- QPU: Quantum Processing Unit (gate-model/annealer). 
- Hybrid: classical control loop + quantum subroutines. 
- Telemetry: normalized device metrics for scheduling. 
| Topic | CUDA (Compute Unified Device Architecture) | UDA (Unified Device Architecture, vendor-agnostic idea) | 
| Scope | NVIDIA GPUs only | Abstracts CPUs/GPUs/FPGAs/accelerators (potentially QPUs) | 
| Programming model | SIMT, CUDA C/C++, kernels, grids/blocks/threads | Portable device model; map to CUDA, HIP, OpenCL, SYCL, FPGA HDL, future QPU APIs | 
| Tooling | nvcc, Nsight Compute/Systems, cuBLAS/cuDNN/cuFFT | Backend adapters; shared runtime; capability discovery | 
| Strengths | Deep libraries + mature perf | Flexibility across vendors and form factors | 
| Risks | Lock-in, single vendor | Lowest-common-denominator unless backends expose extensions | 
| Unified Telemetry (minimal schema) | Type | Example | 
| job_id | string | wkld-2025-09-17-001 | 
| backend | enum | `cpu | 
| device_id | string | GPU0-AD102 | 
| ts_start, ts_end | ISO 8601 | 2025-09-17T18:22:03Z | 
| workload | string | tensor_convolution | 
| size_hint | string | N=8192, K=1024 | 
| throughput_x | float | 10.0 | 
| latency_ms | float | 14.2 | 
| energy_j | float | 5.8 | 
| errors | array | [] | 
| Workload → Accelerator Quick Guide | Best | Also viable | 
| Dense linear algebra (BLAS/GEMM) | CUDA/ROCm | FPGA (pipelined), CPU (MKL/BLIS) | 
| Convolutions | CUDA/ROCm (cuDNN/MIOpen) | FPGA (streaming) | 
| Irregular graphs | FPGA≈CUDA/ROCm | CPU (NUMA-aware) | 
| Low-latency control | CPU/FPGA | — | 
| Combinatorial search | QPU-hybrid (future) | CPU/FPGA heuristics | 
| Device Inventory → Backend Handle | Example ||---|---|---|| CPU | cpu:0 || NVIDIA | cuda:0 || AMD | rocm:1 || FPGA | fpga:xcvu9p-0 || QPU (remote) | qpu:rigetti:Ankaa-9Q |
| Scheduler Decision Hints | Signal | Use | 
| throughput_x | higher is better | primary for batch | 
| latency_ms | lower is better | interactive jobs | 
| energy_j | lower is better | mobile/edge | 
| queue_depth | lower is better | avoid contention | 
| Road to QPU APIs (layers) | Purpose | 
| High-level SDK | circuits, annealing, hybrid loop | 
| IR (QIR/OpenQASM) | portable quantum program | 
| Runtime/Orchestrator | qubit map, transpile, error-mitigation | 
| Hardware Driver | pulses, calibration, job submit | 
| CUDA (what it is) | Notes | 
| NVIDIA parallel platform & API | Kernels on GPU SMs, libraries (cuBLAS/cuDNN) | 
| Memory model | Unified Virtual Addressing, streams, events | 
| Why it matters | Peak throughput for dense numeric workloads | 
















Comments