Lead Platform Engineer (HPC & Stateless Linux)

PFX
Remote United States Full-time 🌐 English
PF
Experience: Senior
Added to JobCollate: March 30, 2026

AI Summary Powered by Gemini

This role is for a Lead Platform Engineer to design and deploy a new on-premises stateless Linux cluster for rendering and production workloads. Key requirements include expert Linux administration, experience with stateless deployments, virtualization, containers, and scripting, with a strong lead mindset. The opportunity is interesting as it involves building the foundational layer of a modern infrastructure for a European branch.

Job Description

The ProjectWe are implementing a new on-premises Linux cluster to support rendering and production workloads across our European branches. We are transitioning our infrastructure to a modern, stateless architecture and need a technical lead to design and deploy the foundational layer.We are looking for an experienced Lead Platform Engineer (contract basis) to join us. We are also open to highly skilled students or recent graduates (e.g., from specialized IT/HPC labs) who can demonstrate the technical mastery required for this build.Scope of WorkYou will work closely with our R&D and IT teams to build the "lowest layer" of the infrastructure. Key responsibilities include:Stateless Cluster Deployment: Implementing an on-premise cluster using technologies like Warewulf.Workload Scheduling: Deploying and configuring SLURM as the primary scheduler.Virtualization & Containers: Managing the environment through Proxmox and designing container images via Singularity / Apptainer.System Tooling: Implementing Icinga for monitoring and building a custom Conda repository for reproducible deployment.Network & Automation: Collaborating on network architecture and supporting CI workflows via GitLab CI.RequirementsMust-Have (The Core):Linux Mastery: Expert-level Linux system administration (Red Hat/Rocky Linux preferred).Infrastructure Architecture: Proven experience building or operating large-scale compute environments (HPC, large-scale K8s, or distributed systems).Stateless & Virtualization: Hands-on experience with stateless deployments, Proxmox/KVM, and container technologies.Scripting: Proficiency in Python or Bash for complex system automation.Lead Mindset: The ability to own the "Foundational Layer" of a project and make high-level architectural decisions.Nice to Have (The "Flavor"):Specific HPC Tools: Prior experience with SLURM, Warewulf, xCAT, or similar provisioning/scheduling tools. (If you are a Linux expert, we expect you can catch up on these specific tools within your first week).Infrastructure-as-Code: Experience with Ansible, Terraform, or Puppet.Specialized Background: Experience in research computing, AI infrastructure, or advanced university/HPC labs.Working SetupType: Part-time Contractor (B2B/Freelance).Duration: Initial project phase estimated at 4–8 weeks.Occupancy: Consistent availability during this project window.Location: Remote or On-Site at any of our European branches.Time Zone: Flexible, with coordination during European working hours.Originally posted on Himalayas

Full Description

The ProjectWe are implementing a new on-premises Linux cluster to support rendering and production workloads across our European branches. We are transitioning our infrastructure to a modern, stateless architecture and need a technical lead to design and deploy the foundational layer.We are looking for an experienced Lead Platform Engineer (contract basis) to join us. We are also open to highly skilled students or recent graduates (e.g., from specialized IT/HPC labs) who can demonstrate the technical mastery required for this build.Scope of WorkYou will work closely with our R&D and IT teams to build the "lowest layer" of the infrastructure. Key responsibilities include:Stateless Cluster Deployment: Implementing an on-premise cluster using technologies like Warewulf.Workload Scheduling: Deploying and configuring SLURM as the primary scheduler.Virtualization & Containers: Managing the environment through Proxmox and designing container images via Singularity / Apptainer.System Tooling: Implementing Icinga for monitoring and building a custom Conda repository for reproducible deployment.Network & Automation: Collaborating on network architecture and supporting CI workflows via GitLab CI.RequirementsMust-Have (The Core):Linux Mastery: Expert-level Linux system administration (Red Hat/Rocky Linux preferred).Infrastructure Architecture: Proven experience building or operating large-scale compute environments (HPC, large-scale K8s, or distributed systems).Stateless & Virtualization: Hands-on experience with stateless deployments, Proxmox/KVM, and container technologies.Scripting: Proficiency in Python or Bash for complex system automation.Lead Mindset: The ability to own the "Foundational Layer" of a project and make high-level architectural decisions.Nice to Have (The "Flavor"):Specific HPC Tools: Prior experience with SLURM, Warewulf, xCAT, or similar provisioning/scheduling tools. (If you are a Linux expert, we expect you can catch up on these specific tools within your first week).Infrastructure-as-Code: Experience with Ansible, Terraform, or Puppet.Specialized Background: Experience in research computing, AI infrastructure, or advanced university/HPC labs.Working SetupType: Part-time Contractor (B2B/Freelance).Duration: Initial project phase estimated at 4–8 weeks.Occupancy: Consistent availability during this project window.Location: Remote or On-Site at any of our European branches.Time Zone: Flexible, with coordination during European working hours.Originally posted on Himalayas

Required Skills

HPC-Engineering Linux-Engineering Platform-Engineering DevOps Site-Reliability-Engineering