About the job:

Millions of individuals worldwide turn to our platform every day to discover new ideas and envision new possibilities. Our mission is to help these individuals find inspiration and build a life they cherish. As a Staff Software Engineer on our ML Training Platform team, you'll play a pivotal role in advancing our mission and driving Pinterest forward. You'll have the opportunity to grow both personally and professionally while contributing to a positive online environment.

Our ML Training Infrastructure team develops foundational tools and infrastructure used across Pinterest, supporting various ML applications such as recommendations, ads, visual search, and more. We focus on ensuring the robustness and efficiency of ML systems, essential for accelerating model development and deployment.

What you’ll do:

  • Design and implement scalable solutions to enhance ML training and inference capabilities using platforms like Kubernetes.
  • Lead critical initiatives such as GPU sharing, intelligent resource management, and fault-tolerant training methods.
  • Define and execute the technical strategy and roadmap for ML Training Infrastructure, encompassing key frameworks like PyTorch, Ray, and Jupyter.
  • Collaborate closely with internal stakeholders, including ML engineers and data scientists, to address development challenges and facilitate customer use cases.
  • Build strong partnerships with leaders across Data and Infrastructure teams to drive comprehensive technical initiatives.
  • Mentor team members and provide technical leadership within the ML Platform group.

What we’re looking for:

  • 7+ years of experience in software engineering with a focus on ML infrastructure or similar batch compute environments.
  • Proven track record of technical leadership, including devising long-term strategies and successfully executing them.
  • Deep understanding of High Performance Computing and parallel computing principles.
  • Ability to manage cross-functional projects and understand the needs of internal customers, particularly ML practitioners and Data Scientists.
  • Proficiency in Python; experience with languages such as C++ and Java is advantageous.
  • Knowledge of GPU programming, containerization, and orchestration technologies is desirable.
  • Experience with cloud data processing technologies (e.g., Apache Spark, Ray, Dask, Flink) and ML frameworks like PyTorch is a plus.

Note: This position does not offer relocation assistance.

Feel free to adjust as needed for your client's specific preferences and requirements!