Focused on the systems that make large-scale AI fast — GPU kernel optimization, LLM inference acceleration, and distributed training infrastructure.
about
I'm an engineer obsessed with making AI models run faster and cheaper at scale. My focus is at the intersection of GPU programming, systems design, and ML infrastructure.
Currently diving deep into CUDA kernel optimization, memory hierarchy tuning, and LLM serving systems like vLLM and TensorRT-LLM. Actively targeting AI Infra / CUDA engineering roles.
When I'm not writing kernels, I'm grinding LeetCode or reading arxiv papers on distributed training and inference acceleration.
skills
projects
Hand-written CUDA kernels for common ML ops — matrix multiply, Flash Attention, softmax. Benchmarked against cuBLAS with Nsight profiling.
Systematic benchmarking of vLLM, TensorRT-LLM, and LMDeploy across batch sizes and architectures. Latency vs throughput analysis.
Experiments with tensor / pipeline / data parallelism using PyTorch + DeepSpeed. Scaling laws analysis on small models.
mini game
contact
Open to AI Infra / CUDA engineering roles.
Let's talk about making models fast at scale.