(Senior) SRE & Infra Engineer Job at Prime Intellect, Remote

QUY3bDhqSWd4N1hQMHUwaFVpWUZha3F0bHc9PQ==
  • Prime Intellect
  • Remote

Job Description

Building the Future of Decentralized AI Development

At Prime Intellect, we're building the foundation for decentralized AI development at scale. Our platform combines powerful distributed training infrastructure with an intuitive developer experience, enabling researchers and engineers to train state-of-the-art models collaboratively.

We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others.

Role Impact

This hybrid role spans across platform reliability and infrastructure engineering. You'll be instrumental in:

  • Infrastructure Reliability: Ensuring high availability, fault tolerance, and performance across internal research and external customers’ GPU cluster environments.

  • Cluster Onboarding & Support: Automating GPU cluster onboarding, handling support requests, and troubleshooting operational challenges.

  • Observability, Security & Feature Development: Enhancing monitoring, logging, and security systems, and developing new backend features to boost platform functionality.

Core Technical Responsibilities

Operational Excellence & Support

  • Cluster Onboarding: Develop and automate procedures to integrate internal research clusters and external customer deployments.

  • Incident Management: Lead efforts in incident detection, response, and postmortem analysis to drive continuous improvement.

  • Support Engineering: Address platform support requests by diagnosing and resolving reliability issues promptly.

Infrastructure Automation & Reliability

  • Monitoring & Observability: Design and implement comprehensive observability solutions using tools like Prometheus and Grafana, ensuring proactive detection of issues.

  • Automation & Orchestration: Utilize tools such as Ansible, Terraform, and Kubernetes to streamline infrastructure management and automation.

Backend & Feature Development

  • New Feature Engineering: Collaborate with the engineering team to design and implement backend features.

  • API and Service Development: Enhance our platform’s REST APIs and backend services to support new capabilities and improve overall performance.

  • System Integration: Ensure seamless integration of new features into our existing infrastructure, maintaining high reliability and security standards.

Technical Requirements

Reliability & SRE Skills

  • Incident & Monitoring Expertise: Proven experience with monitoring tools (e.g., Prometheus, Grafana) and incident management practices.

  • Automation Proficiency: Strong skills in infrastructure automation with Ansible, Terraform, or similar.

  • Observability & Logging: Deep understanding of logging frameworks, alerting systems, and proactive monitoring solutions.

Development & Infrastructure Skills

  • Backend Engineering: Proficiency in Python for developing automation scripts, REST APIs, and backend support tools.

  • Container & Cloud Technologies: Hands-on experience with Kubernetes and cloud platforms (GCP preferred).

Nice to Have

  • Familiarity with GPU computing and AI/ML training infrastructure.

  • Experience contributing to open-source infrastructure projects.

  • Knowledge of high-performance networking and real-time systems.

What We Offer

  • Competitive compensation with significant equity and token incentives

  • Flexible work arrangement (remote or San Francisco office)

  • Full visa sponsorship and relocation support

  • Professional development budget for courses and conferences

  • Regular team off-sites and conference attendance

  • Opportunity to shape the future of decentralized AI development

Growth Opportunity

You'll join a team of experienced engineers and researchers working on cutting-edge problems in AI infrastructure. We believe in open development and encourage team members to contribute to the broader AI community through research and open-source contributions.

We value potential over perfection - if you're passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.

Ready to help shape the future of AI? Apply now and join us in our mission to make powerful AI models accessible to everyone.

Job Tags

Remote job, Visa sponsorship, Relocation package, Flexible hours,

Similar Jobs

Insight Global

Site Reliability Engineer (SRE) Job at Insight Global

Job DescriptionBD is seeking a Site Reliability Engineer (SRE) to join its Operations team focused on building scalable cloud infrastructure, automation tools, and observability platforms to support development teams. This role blends engineering and operationsideal for... 

i9 Sports

Summer Camp Youth Sports Supervisor - Sammamish Job at i9 Sports

 ...athletes Online training opportunities Company Overview i9 Sports offers youth sports leagues, camps, and clinics for kids ages 4-...  ..., and baseball. With our focus on fun, safety, convenience, and good sportsmanship, i9 Sports is reinventing the youth sports... 

The Phelps - Residence Inn Cincinnati Downtown

Director of Sales Job at The Phelps - Residence Inn Cincinnati Downtown

Job Description Overview Director of Sales & Marketing Residence Inn Cincinnati Downtown/The Phelps | A Historic, All-Suite Hotel in the Heart of the City Come Grow With Us! We are seeking a dynamic Director of Sales & Marketing to lead the commercial...

New Settlement

Construction Manager Job at New Settlement

 ...construction skills including demolition, rough framing, roofing, drywall, finish carpentry, sheet metal, electrical, plumbing, painting, and masonry.Be responsible for overall safety enforcement.Assist in the evaluation of student knowledge and skills in construction.... 

Princess Cruises

Chef De Partie Job at Princess Cruises

 ...to this position, your application will be submitted to Princess Cruises' internal Talent Acquisition team and will be reviewed by a...  ...production from international Hotels/Resorts or Cruise Industry.; Cruise ship experience is strongly preferred.; Deep knowledge of...