About Us The last era of AI scaled on a single bet: bigger models, more identical chips, more data. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. Real-world problems are heterogeneous: no single model or chip can solve them alone. The next era of AI requires heterogeneity at the infrastructure level - diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability that move the Pareto frontier of what is possible. That's what we are building. Callosum is the Intelligent Systems Company. We started from questioning what actually creates intelligence. We believe there is no single answer, but rather a system-level solution. We co-evolve models, workflows, and silicon together to show that intelligence does not come from a single component, but it emerges from the diversity of co-optimised mechanisms working together and aware of each other. Heterogeneity will define the next era of compute, and is a principle that holds in biological, neuronal, and economic systems alike. In early 2026 we launched with results showing orders of magnitude improvements in performance, and this is only the beginning. Agentic AI is the future of how intelligence is deployed: multi-step, long-horizon, and operating in changing environments. These systems are inherently heterogeneous, and can only be as powerful as the infrastructure that runs them. We are engineers and scientists based in London, working together across the full depth of the stack. We are curious, intellectually honest, and building what doesn't exist yet. If you thrive on uncharted territory and are energised by the scale of the challenge, we'd love to hear from you. About the Role Callosum believes that orders of magnitude improvements in AI systems will come through application-aware orchestration across heterogeneous hardware. We are building that vision: infrastructure that treats the full landscape of compute as a unified, co-evolving system, evolved beyond GPUs. Inference engines were designed for single-model inference on homogeneous GPU clusters - this role builds them beyond that. Working directly on systems like vLLM and SGLang, you will adapt and extend them for heterogeneous resources, making them hardware-aware, with deeper optimisation around scheduling, memory, and execution. The execution strategies you design - parallelism, disaggregation, caching - will define what heterogeneous inference looks like at production scale. Your work ensures that the capabilities exposed by the lower layers of the stack translate into real, measurable gains, the new standard for how inference runs on diverse hardware. What You'll Build * Contribute upstream to SGLang and vLLM, and maintain internal forks where our requirements diverge * Improve hardware-awareness within inference engines so that scheduling, memory management, and execution adapt to the capabilities of the underlying accelerator * Design and implement bespoke parallelism and disaggregation strategies that go beyond default configurations to better exploit heterogeneous hardware * Work closely with an Accelerator Systems Software engineer to ensure engine-level abstractions map cleanly onto diverse hardware capabilities What Sets You Apart * Deep familiarity with the internals of SGLang, vLLM, or comparable inference serving frameworks - scheduler design, memory management, and execution pipelines * Strong background in high-performance Python and C++/CUDA systems, particularly in the context of ML inference * Experience designing or implementing parallelism strategies for large model serving * Understanding of disaggregated serving architectures and the tradeoffs involved in separating modules of a workflow * Demonstrable record of working effectively in fast-moving open source codebases with evolving APIs and design conventions What We Offer * Competitive Salary, determined by skills and experience * Equity & Ownership * Private healthcare * We offer Visa sponsorship and relocation benefits to hire the best in the world * We work in person at our London office. You'll have the tools, space and setup to do your best work, and if you have specific needs, just tell us We're committed to building an inclusive workplace where everyone feels welcome, and believe in equal opportunities for all.

Inference Engine Development - Member of Technical Staff

Similar jobs