CUDA vs. UDA for Laniakea OS — and the Road to QPU APIs

Erick Eduardo Rosado Carlin
Sep 20
5 min read

TL;DR

CUDA = vendor-specific (NVIDIA) GPU platform with best-in-class tooling and performance.
UDA (Unified Device Architecture) = vendor-neutral idea: one interface for many accelerators (CPU/GPU/FPGA/…QPU).
Laniakea OS support both: CUDA for peak GPU speed and UDA-style abstraction for portability today and QPU integration tomorrow.

1) What is CUDA?

NVIDIA’s Compute Unified Device Architecture: a programming model (kernels, grids/blocks/threads), compiler toolchain (nvcc), driver/runtime, and libraries (cuBLAS, cuDNN, NCCL, Thrust…) that expose massive GPU parallelism for general-purpose compute.

Why teams pick it

Mature ecosystem and profilers (Nsight), highly optimized libs, broad cloud/on-prem availability.
Tight control over memory hierarchy (global/shared/constant) and occupancy.

2) What is “UDA” (Unified Device Architecture)?

A concept rather than a single product: a unified API that can target diverse accelerators (CPU, NVIDIA/AMD GPU, FPGA, DSP—and eventually QPU). Think of it as the portability layer that keeps app code stable while the backend changes.

Concrete realizations of “UDA-like” portability include:

SYCL / oneAPI (Khronos/Intel): C++ single-source kernels, backends for CPU, Level-Zero, CUDA, HIP.
OpenCL: cross-vendor compute API (lower-level).
HIP/ROCm (AMD): CUDA-like model for AMD GPUs, with some CUDA translation.
ML frameworks runtimes that dispatch to multiple backends.

3) Why both matter to Laniakea OS

Performance now (CUDA): When NVIDIA GPUs are present, CUDA provides the fastest path.
Flexibility forever (UDA): A portability layer lets Laniakea schedule the same job on CPUs/GPUs/FPGAs—and future QPUs—without app rewrites.
Operations: A unified telemetry schema (utilization, memory, errors, energy) enables one scheduler to optimize placement across all devices.

4) Side-by-side comparison

Axis	CUDA (NVIDIA)	UDA-style (e.g., SYCL/oneAPI/OpenCL/HIP wrappers)
Vendor scope	NVIDIA only	Multi-vendor / multi-device
Performance	Peak on NVIDIA GPUs via tuned libs	Competitive; depends on backend vendor libraries
Tooling	Nsight, nvprof, CUPTI, rich ecosystem	Improving (VTune/Advisor, Codeplay tools, ROCm tools), more variance
Portability	Low (CUDA code)	High (single source, multiple targets)
Maintenance	Duplicate paths for non-NVIDIA	One codebase; backend adapters
Best use	NVIDIA-heavy fleets, latency-critical paths	Heterogeneous fleets, long-term portability, QPU runway

5) Laniakea OS: runtime & telemetry blueprint

Scheduler responsibilities

Discover devices (CPU/GPU/FPGA/QPU) → capabilities (SMs/CU, mem, drivers, features).
Normalize telemetry (utilization, mem pressure, temp, power, errors, queue depth).
Match workload → backend: dense_linear_algebra → CUDA/cuBLAS, bit-exact streaming → FPGA, combinatorial search → QPU(hybrid)+CPU.

Unified telemetry schema (sample)

Metric	CPU	GPU	FPGA	QPU(hybrid)	Notes
Utilization (%)	✓	✓	✓	✓	normalized per backend
Mem used / total	✓	✓ (HBM/VRAM)	✓	✓ (shots cache)
Temperature (°C)	✓	✓	✓	—	QPU uses dilution-fridge metrics instead
Power (W)	✓	✓	✓	—	QPU has fridge power; abstract separately
Error counters	✓	✓ (ECC)	✓	✓ (T1/T2 drift, readout err)
Queue depth	✓	✓	✓	✓	back-pressure signal

6) “Workload → Accelerator” quick guide

Workload pattern	Best default	Portable fallback
Dense BLAS (GEMM/conv)	CUDA (cuBLAS/cuDNN)	oneAPI MKL / ROCm rocBLAS / CPU MKL
Sparse / graph traversals	GPU (CUDA) or CPU if branchy	SYCL/OpenCL backends
Streaming bit-level pipelines	FPGA	CPU SIMD / GPU custom kernels
Combinatorial optimization / QAOA-like	Hybrid QPU + CPU/GPU	Classical heuristic meta-solvers

Note: choose by measured perf; above are starting heuristics.

7) The road to QPU APIs

Layer	What it does	Today’s analog	Laniakea plan
Circuit DSL	Express circuits/ansätze	OpenQASM, QIR	Ingest standard DSLs
IR & transpile	Map to hardware topology	MLIR-Quantum, QIR, t	ket>
Runtime	Submit circuits/jobs; manage shots	Cloud runtimes (gRPC/REST)	Unified “DeviceQueue” API
Telemetry	Shots/s, queue depth, error rates, calib drift	Provider dashboards	Normalize to scheduler schema
Hybrid control	Classical loop around quantum calls	Orchestration SDKs	Built-in hybrid tasks (async futures)

8) Example graph

Relative throughput (illustrative) across backendsBaseline CPU = 1.0; higher is better (not real benchmarks)

Vector Math (BLAS1)
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████ (8.0)
ROCm/HIP |██████████████████████████ (6.5)
FPGA     |███████████ (3.0)
QPU(h)   |██ (0.2)  [classical task; QPU not ideal]

Tensor Convolution
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████████████ (10.0)
ROCm/HIP |███████████████████████████████ (7.5)
FPGA     |██████████████ (4.0)
QPU(h)   |██ (0.2)

Graph Traversal (irregular)
CPU      |████████████ (1.0)
CUDA     |████████████████████ (3.5)
ROCm/HIP |██████████████████ (3.0)
FPGA     |███████████████ (3.8)
QPU(h)   |███ (0.3)

Note: Values are illustrative to visualize trade-offs; always measure on your hardware.

9) Two tables you can paste into docs

A) Device inventory → backend handle

Device type	Probe	Backend handle	Notes
CPU	/proc/cpuinfo, CPUID, sysfs	uda://cpu/0	NUMA & SIMD flags
NVIDIA GPU	CUDA runtime cudaGetDeviceProperties	cuda://gpu/0	SMs, HBM size
AMD GPU	ROCm SMI / HIP	hip://gpu/0	CUs, HBM size
FPGA	OpenCL / vendor SMI	ocl://fpga/0	bitstream ID
QPU	Provider gRPC/REST	qpu://provider/systemA	topology, calib time

B) Scheduler decision hints

Hint	Source	Effect
job.tags includes "dense-blas"	user/job meta	prefer CUDA/ROCm
power_cap < X	energy policy	prefer CPU/FPGA
qpu.calib_age > threshold	QPU telemetry	delay quantum jobs
gpu.mem_free < model_mem	GPU telemetry	spill or split batch

10) Minimal pseudo-API (portable compute call)

11) Practical guidance for Laniakea OS

Support CUDA directly where present; keep UDA-style path for everything else.
Unify telemetry early; scheduling wins come from shared metrics.
Abstract jobs (not devices): express intent (dense, sparse, bit-level, hybrid-quantum).
Prepare for QPUs via a clean async “submit/await” interface and hybrid loops.
Continuously benchmark; keep routing decisions empirical, not dogmatic.

12) One-page glossary

CUDA: NVIDIA platform for general GPU compute.
UDA (concept): one API for many accelerators.
SYCL/oneAPI: portable C++ single-source model targeting multiple backends.
HIP/ROCm: AMD’s CUDA-like stack.
QPU: Quantum Processing Unit (gate-model/annealer).
Hybrid: classical control loop + quantum subroutines.
Telemetry: normalized device metrics for scheduling.

Topic	CUDA (Compute Unified Device Architecture)	UDA (Unified Device Architecture, vendor-agnostic idea)
Scope	NVIDIA GPUs only	Abstracts CPUs/GPUs/FPGAs/accelerators (potentially QPUs)
Programming model	SIMT, CUDA C/C++, kernels, grids/blocks/threads	Portable device model; map to CUDA, HIP, OpenCL, SYCL, FPGA HDL, future QPU APIs
Tooling	nvcc, Nsight Compute/Systems, cuBLAS/cuDNN/cuFFT	Backend adapters; shared runtime; capability discovery
Strengths	Deep libraries + mature perf	Flexibility across vendors and form factors
Risks	Lock-in, single vendor	Lowest-common-denominator unless backends expose extensions

Unified Telemetry (minimal schema)	Type	Example
job_id	string	wkld-2025-09-17-001
backend	enum	`cpu
device_id	string	GPU0-AD102
ts_start, ts_end	ISO 8601	2025-09-17T18:22:03Z
workload	string	tensor_convolution
size_hint	string	N=8192, K=1024
throughput_x	float	10.0
latency_ms	float	14.2
energy_j	float	5.8
errors	array	[]

Workload → Accelerator Quick Guide	Best	Also viable
Dense linear algebra (BLAS/GEMM)	CUDA/ROCm	FPGA (pipelined), CPU (MKL/BLIS)
Convolutions	CUDA/ROCm (cuDNN/MIOpen)	FPGA (streaming)
Irregular graphs	FPGA≈CUDA/ROCm	CPU (NUMA-aware)
Low-latency control	CPU/FPGA	—
Combinatorial search	QPU-hybrid (future)	CPU/FPGA heuristics

| Device Inventory → Backend Handle | Example ||---|---|---|| CPU | cpu:0 || NVIDIA | cuda:0 || AMD | rocm:1 || FPGA | fpga:xcvu9p-0 || QPU (remote) | qpu:rigetti:Ankaa-9Q |

Scheduler Decision Hints	Signal	Use
throughput_x	higher is better	primary for batch
latency_ms	lower is better	interactive jobs
energy_j	lower is better	mobile/edge
queue_depth	lower is better	avoid contention

Road to QPU APIs (layers)	Purpose
High-level SDK	circuits, annealing, hybrid loop
IR (QIR/OpenQASM)	portable quantum program
Runtime/Orchestrator	qubit map, transpile, error-mitigation
Hardware Driver	pulses, calibration, job submit

CUDA (what it is)	Notes
NVIDIA parallel platform & API	Kernels on GPU SMs, libraries (cuBLAS/cuDNN)
Memory model	Unified Virtual Addressing, streams, events
Why it matters	Peak throughput for dense numeric workloads