AI Infra Engineer · CUDA Developer · Open to Roles

Dante Xiang

Building

Focused on the systems that make large-scale AI fast — GPU kernel optimization, LLM inference acceleration, and distributed training infrastructure.

View Projects GitHub →

LeetCode Problems

CUDA Projects

Inference Speedup

2026

Target Year

about

Who I Am

I'm an engineer obsessed with making AI models run faster and cheaper at scale. My focus is at the intersection of GPU programming, systems design, and ML infrastructure.

Currently diving deep into CUDA kernel optimization, memory hierarchy tuning, and LLM serving systems like vLLM and TensorRT-LLM. Actively targeting AI Infra / CUDA engineering roles.

When I'm not writing kernels, I'm grinding LeetCode or reading arxiv papers on distributed training and inference acceleration.

❯ cat focus.txt

→ CUDA kernel optimization

→ LLM inference & serving

→ Distributed training systems

→ Memory bandwidth & hierarchy

→ GPU cluster scheduling

❯ echo $STATUS

🟢 Open to opportunities

# target: AI Infra / CUDA roles 2026

skills

Tech Stack

⚡

GPU / CUDA

CUDA C++85%

Triton70%

cuBLAS / NCCL65%

Nsight Profiling75%

🧠

ML Infra

PyTorch90%

vLLM75%

TensorRT-LLM65%

DeepSpeed / Megatron60%

🛠️

Systems

C++80%

Python92%

Linux / Docker85%

Kubernetes65%

projects

What I'm Building

COMING SOON

⚡

CUDA Kernel Playground

Hand-written CUDA kernels for common ML ops — matrix multiply, Flash Attention, softmax. Benchmarked against cuBLAS with Nsight profiling.

CUDA C++cuBLASNsight

COMING SOON

🚀

LLM Inference Benchmark

Systematic benchmarking of vLLM, TensorRT-LLM, and LMDeploy across batch sizes and architectures. Latency vs throughput analysis.

vLLMTensorRT-LLMPython

COMING SOON

🧮

Distributed Training Lab

Experiments with tensor / pipeline / data parallelism using PyTorch + DeepSpeed. Scaling laws analysis on small models.

PyTorchDeepSpeedNCCL

mini game

CUDA Defender — destroy the GPU enemies

      ← → move
      SPACE shoot
      P pause
    

⚡
CUDA Defender
Destroy Cache Misses, OOM Errors, and Race Conditions
before they corrupt your training run.

      SCORE: 0
      LEVEL: 1
      LIVES: ❤❤❤
      HI: 0
    

contact

Let's Build Together

Open to AI Infra / CUDA engineering roles.
Let's talk about making models fast at scale.

GitHub