GPU Acceleration
High-performance GPU computing features in XDL.
Overview
XDL’s AMP (Accelerated Math Processing) module provides GPU acceleration with 11 backend support:
Apple Platforms (macOS/iOS)
- Metal - Native Apple GPU compute (default on macOS)
- MPS - Metal Performance Shaders (optimized operations)
- CoreML - Apple Neural Engine acceleration
NVIDIA Platforms
- CUDA - NVIDIA GPUs (best performance on NVIDIA hardware)
- cuDNN - Deep learning acceleration
AMD Platforms
- ROCm - AMD GPUs (optimized for AMD hardware)
Windows
- DirectML - ML acceleration on DirectX
- DirectX 12 - GPU compute via DirectML delegation
Cross-Platform
- Vulkan - Modern cross-platform GPU compute
- OpenCL - Universal GPU fallback (AMD, Intel, NVIDIA)
- CPU (SIMD) - Fallback for systems without GPU support
Key Features
Automatic Acceleration
GPU acceleration is transparent - existing code runs faster without changes:
; These operations are automatically GPU-accelerated
a = findgen(10000000)
b = findgen(10000000)
c = a + b ; GPU vector addition
d = sin(a) ; GPU trigonometric functions
e = a * b + c ; GPU complex expressions
Performance Gains
Typical speedups on GPU vs CPU:
| Operation | Array Size | CPU Time | GPU Time | Speedup |
|---|---|---|---|---|
| Vector Add | 10M | 45ms | 2ms | 22.5x |
| Sin | 10M | 120ms | 5ms | 24x |
| Matrix Multiply | 4096x4096 | 850ms | 12ms | 70x |
| FFT | 1M | 200ms | 8ms | 25x |
Documentation
- GPU Compute Implementation - Technical overview
- Performance Impact Analysis - Benchmarks
- AMP Multi-Backend - Backend configuration
- GPU Demo Guide - Examples and tutorials
Supported Operations
Vector Operations
- Addition, subtraction, multiplication, division
- Element-wise operations
- Vector reductions (sum, min, max, mean)
Mathematical Functions
- Trigonometric: sin, cos, tan, asin, acos, atan
- Exponential: exp, log, log10, sqrt, pow
- Hyperbolic: sinh, cosh, tanh
Matrix Operations
- Matrix multiplication
- Matrix transpose
- Matrix inversion
- Eigenvalue decomposition
Advanced Operations
- FFT (Fast Fourier Transform)
- Convolution
- Correlation
- Image processing
Backend Selection
GPU backend is selected automatically based on available hardware:
# Check available GPU backends
xdl --features
# Force specific backend
XDL_GPU_BACKEND=metal xdl script.xdl # macOS
XDL_GPU_BACKEND=cuda xdl script.xdl # NVIDIA
XDL_GPU_BACKEND=rocm xdl script.xdl # AMD
XDL_GPU_BACKEND=vulkan xdl script.xdl # Cross-platform
XDL_GPU_BACKEND=opencl xdl script.xdl # Universal
XDL_GPU_BACKEND=directml xdl script.xdl # Windows
XDL_GPU_BACKEND=cpu xdl script.xdl # CPU fallback
Build with Specific Features
# OpenCL support
cargo build --features opencl
# CUDA support
cargo build --features cuda
# DirectML support (Windows)
cargo build --features directml
# All backends
cargo build --features all-backends
Profiling
Enable GPU profiling:
# Enable profiling
XDL_GPU_PROFILE=1 xdl script.xdl
# Detailed profiling
XDL_GPU_PROFILE=verbose xdl script.xdl
Memory Management
XDL automatically manages GPU memory:
- Automatic transfer - Data moved to/from GPU as needed
- Memory pooling - Efficient reuse of GPU memory
- Spill to CPU - Graceful handling of large datasets
Limitations
Current limitations:
- Maximum array size: 2GB per array
- Some operations fall back to CPU
- Multi-GPU support in development
Next Steps
- Quick Start - Get started with GPU
- Performance Guide - Optimization tips
- Technical Details - Implementation details
- GPU Demo Guide - Examples and tutorials