HPC Engineer, Machine Learning Infrastructure – EMEA Remote at Hugging Face #vacancy #remote

About the Company: At Hugging Face, we are dedicated to advancing Machine Learning and making it more accessible. We have developed the fastest-growing open-source library of pre-trained models globally, with over 1 Million+ models and 320K+ stars on GitHub. Our technology is utilized by more than 15,000 companies, including prominent organizations like Google, Elastic, Salesforce, Grammarly, and NASA.

About the Role: We are seeking a HPC Engineer to play a key role in developing and scaling our distributed large cluster. The ideal candidate will have expertise in provisioning large compute clusters for AI workflows, implementing best practices for reliability and scalability, and supporting teams effectively.

Responsibilities:

Design, deploy, and maintain reliable and scalable infrastructure for efficient training workloads.

Manage large compute clusters for AI Training and development.

Create tooling and infrastructure to streamline compute and storage in ML workflows.

Measure and optimize system performance.

Monitor and troubleshoot infrastructure issues to ensure high availability and performance.

Stay updated on the latest AI infrastructure technologies and suggest improvements for enhanced system efficiency.

Collaborate with AI software engineering teams to meet infrastructure requirements.

Provide operational support and engineering for multiple teams.

Qualifications: The ideal candidate should have:

7+ years of experience in DevOps or infrastructure engineering roles.

Experience in building machine learning infrastructure and working with large GPU clusters.

Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks, and observability tools.

Familiarity with Python Scientific stack, Pytorch.

Experience with data structures, data modeling, and database management.

Strong communication, collaboration, and documentation skills.

Proficiency in Linux, Git, containers, networking, and command line tools.

Strong programming skills in Python, Golang, and/or Rust.

About You: If you are a passionate HPC Engineer with a keen interest in AI, we invite you to join our team of talented professionals. Your contributions will play a vital role in advancing AI technologies within a collaborative and stimulating environment.

More about Hugging Face: At Hugging Face, we are committed to fostering a culture of diversity, equity, and inclusivity. We value each individual’s background and strive to create a workplace where everyone feels respected and supported. We are an equal opportunity employer that promotes continuous growth and development among employees.

We offer various benefits, including flexible working hours, health, dental, and vision benefits, parental leave, and support for employee well-being. Additionally, all employees receive company equity as part of their compensation, aligning everyone’s success with the company’s achievements.

Join our community and be part of the collaboration that drives major scientific advancements in the field of ML/AI.

#J-18808-Ljbffr

DevOps Git Go Google Cloud Platform (GCP) Machine Learning infrastructure Python Amazon Web Services (AWS) Linux PyTorch Rust

Leave a Reply