Founding Engineer - ML Performance

Remote, USA Full-time Posted 2026-05-31
Apply Now

The problem we saw
Most AI infrastructure is built for batch: send a query, wait, get a response, reset. Powerful, but transactional. AI is becoming interactive — sessions that hold state, models that stay alive between turns, generation that responds as it runs — and the infrastructure to deliver that at scale doesn't really exist yet.
The bottleneck isn't the models anymore. It's the infrastructure underneath them.
What we're building to fix it
uRun is the inference cloud for interactive AI: the compute layer that makes real-time, stateful inference possible at scale. We came out of stealth in April 2026, are backed by top-tier investors, and are founded by Keegan McCallum, who scaled inference infrastructure for some of the most demanding generative AI workloads in production.
We're an infrastructure company. We build the layer that model labs, builders, and research teams ship on top of.
Where you come in
Performance is uRun's core differentiator. We're not chasing incremental gains — we're building infrastructure that runs 10–100x faster than the status quo. As our ML Performance Engineer, you will be the person who makes that true.
This is a founding technical hire. You will write custom CUDA kernels, push GPU utilization to its limits, and own inference latency end-to-end across the stack. You will work directly with the founding team on the hardest performance problems in production AI infrastructure — and your fingerprints will be on everything we ship.
What you'll actually be doing day-to-day
Write custom CUDA kernels that unlock performance headroom unavailable through off-the-shelf frameworks

Optimize model inference end-to-end, targeting sub-50ms latency across our inference platform

Drive 10x performance improvements across the stack: memory bandwidth, kernel fusion, operator scheduling, and beyond

Implement zero-copy distributed memory optimizations across multi-GPU and multi-node environments

Own GPU utilization and memory management, squeezing every available FLOP out of the hardware we run

Profile, benchmark, and instrument the full inference pipeline to find and eliminate bottlenecks systematically

Set the performance engineering bar for the team: define what fast looks like and build the tooling to measure it

What skills you need for the journey
Deep, hands-on CUDA expertise: you have written custom kernels in production, not just called into cuBLAS

Strong background in model inference and post-training optimization at scale

Fluency in GPU memory hierarchy, warp scheduling, kernel fusion, and hardware-aware algorithm design

Experience profiling and benchmarking complex inference pipelines: you know where the time goes and how to get it back

Able to operate at the frontier with minimal guidance — you identify the problem, design the approach, and ship the fix

Things that will give you an edge
Public work in GPU optimization or inference efficiency — open source contributions, a published paper, or a side project that shows your depth (vLLM, Flash-Attention, TensorRT-LLM, PyTorch, or equivalent)

Experience with hardware-aware optimization frameworks: CuTe, Triton, TileLang, or similar

Familiarity with distributed memory and communication primitives: NCCL, InfiniBand, NVLink, RoCE

Contributions to or deep familiarity with PyTorch Distributed, Ray core, or similar systems

Experience optimizing for video generation or other high-throughput, latency-sensitive generative workloads

Prior work at an inference-focused company or research lab pushing the boundary of what GPU hardware can do

What you'll get in return
Competitive salary and meaningful equity in an early-stage AI infrastructure company. The band above is our target; for an exceptional candidate we'll go higher. Equity is real — you're early, and the grant reflects that.
Health, dental, and vision — full coverage

401(k) — company-supported retirement savings

FSA/HSA — flexible spending accounts for healthcare costs

Paid time off — we trust you to manage your time

Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster

MacBook Pro and AirPods — the hardware you need, on us

How we work (and what that feels like day-to-day)
We build the stage, not the show. We're an infrastructure company, a developer-tools company, and a production partner for model labs, and focus is a deliberate choice we've made and hold to.
Day-to-day, that means a small team, a high bar, and real ownership. You won't wait for permission or inherit a backlog of someone else's decisions, in a founding security role, the function is what you make it.
It also means ambiguity: priorities shift, not everything is documented, and you'll often be the person who decides what "secure enough, for now" means. That suits some people and not others, and we'd rather you know that before you apply.
Watch our launch party video

Read the manifesto

Follow us on LinkedIn

Follow us on X

Apply To This Job

Similar Jobs