Mirrai Careers
Resume BuilderCareer Test
InsightsPricing
Get Started Free
Jobs/Member of Technical Staff - Infrastructure

Member of Technical Staff - Infrastructure

gimlet

San Francisco Full-time Posted 10d ago
Apply on company site
About Us Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them. The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together. Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization. We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI. ABOUT THIS ROLE We are looking for an Infrastructure platform Engineer to design, build, and operate the cluster infrastructure behind Gimlet’s heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one. This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet’s infrastructure can support demanding AI workloads at production scale. WHAT YOU WILL WORK ON * Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference. * Build automation for provisioning, configuration, upgrades, validation, and lifecycle management. * Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors.Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems. * Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers. * Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects. * Build observability for cluster health, capacity, performance, failures, and workload behavior. * Improve reliability, availability, and recovery across multi-node production systems. * Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads. * Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs. * Create runbooks, operational standards, and incident response practices as the fleet scales. YOU MAY BE A GOOD FIT IF * Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems. * Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues. * Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems. * Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent. * Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation. * Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics. * Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production. * Comfort working in a fast-moving startup environment with high ownership and ambiguity. STRONG CANDIDATES MAY ALSO HAVE * Experience building or operating AI inference, training, HPC, or neocloud infrastructure. * Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up. * Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting. * Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks. * Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling. * Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.

See how well you match this job

Upload your resume and we’ll score your fit for this role and 6 similar roles — then tailor your CV to it with AI. Free, no credit card.

Check your match

Similar jobs

  • Member of Technical Staff, Infrastructure

    mandolin

    San Francisco$160k–$270k
  • Member of Technical Staff, Infrastructure

    Vapi

    San Francisco$200k–$280k
  • Network Engineer

    gimlet

    San Francisco$250k–$320k
  • Software Engineer, Compute Infrastructure

    OpenAI

    Remote$230k–$405k
  • Senior Infrastructure Engineer

    Bland AI

    San Francisco$120k–$200k
  • Staff Software Engineer, Inference Infrastructure

    Cohere

    Remote
Apply on company site

Want more roles like this? Browse fresh jobs or tailor your resume with AI.

Mirrai Careers

AI-powered career platform: build resumes, match jobs, and plan your career.

Product

  • All Tools
  • Resume Builder
  • Career Test
  • Pricing

Legal

  • Privacy Policy
  • Terms of Service
  • Fair Use Policy

Company

MIRRAI CHAT LTD (Company No. 16403306)

71-75 Shelton Street, Covent Garden

London, WC2H 9JQ, UNITED KINGDOM

[email protected]

© 2026 Mirrai Careers. All rights reserved.