HPC System Engineer

最近更新於 2024-08-31

立即應徵

工作內容

Homebrew is an AI R&D Lab. We train our own models, are the creators and maintainers of popular open-source AI tools:

Jan: Desktop Copilot (>1 million downloads)
Cortex: Local, open-source alternative to OpenAI Platform
Menlo: GPU Training Cluster

We are a fully remote company. In the long term, our objective is to train useful, safe AI that helps improve humanity.

Job Description

We are seeking an experienced HPC Engineer to design, deploy, and maintain a high-performance computing (HPC) cluster for our AI training workloads. The successful candidate will be responsible for setting up a GPU-based training cluster together with our Research team, and ensuring that works well with our Model Training Algorithms.

Key Responsibilities:

Design and deploy a GPU-based HPC cluster using industry-standard components (eg, NVIDIA DGX/HGX, or similar), including the design of nodes (eg NVLink, SXM)
Configure and optimize the cluster for high-performance computing, focusing on AI workloads (eg, PyTorch, Torch or similar).
Implement and manage cluster management software (eg, Kubeflow, Slurm or similar).
Design cluster for high-bandwidth, low-latency network performance in GPU clusters (InfiniBand, Ethernet RDMA, and/or RoCE), using scalable and efficient network topologies (Fat Tree, Dragonfly, and/or Torus)
Troubleshoot and resolve issues related to cluster performance, hardware failures, and software glitches.

條件要求

Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
Extensive experience in designing, assembling, and configuring high-performance computing systems
Proficient in selecting and integrating HPC hardware components, including CPUs, GPUs, memory, storage, and interconnects
Strong knowledge of HPC software stacks, including operating systems, drivers, and specialized applications
Experience in designing and operating AI training clusters, including the selection and integration of the necessary hardware and software components
Expertise in conducting comprehensive benchmarking tests and analyzing performance data
[Plus] Strong networking knowledge, including experience with high-speed interconnects such as Infiniband, RoCE Ethernet, and RDMA
[Plus] Experience with setting up and managing Nvidia multi-node training clusters for machine learning applications

遠端型態

完全遠端面試

4 rounds of fully remote interview

完全遠端工作

Fully remote anywhere in Taiwan

員工福利

法定項目

週休二日、安胎假、婚假

其他福利

We pay an “all-in” pay and you will cover your own insurance/medical from the amount.
Fully remote working
14 days leave (and unlimited sick days)
Annual equipment budget (once 2 month probation has been completed)
Working on the state-of-the-art technology - LLM field.

薪資範圍

NT$ 1,800,000 - (年薪)