Homebrew is an AI R&D Lab. We train our own models, are the creators and maintainers of popular open-source AI tools:
- Jan: Desktop Copilot (>1 million downloads)
- Cortex: Local, open-source alternative to OpenAI Platform
- Menlo: GPU Training Cluster
We are a fully remote company. In the long term, our objective is to train useful, safe AI that helps improve humanity.
Job Description
We are seeking an experienced HPC Engineer to design, deploy, and maintain a high-performance computing (HPC) cluster for our AI training workloads. The successful candidate will be responsible for setting up a GPU-based training cluster together with our Research team, and ensuring that works well with our Model Training Algorithms.
Key Responsibilities:
- Design and deploy a GPU-based HPC cluster using industry-standard components (eg, NVIDIA DGX/HGX, or similar), including the design of nodes (eg NVLink, SXM)
- Configure and optimize the cluster for high-performance computing, focusing on AI workloads (eg, PyTorch, Torch or similar).
- Implement and manage cluster management software (eg, Kubeflow, Slurm or similar).
- Design cluster for high-bandwidth, low-latency network performance in GPU clusters (InfiniBand, Ethernet RDMA, and/or RoCE), using scalable and efficient network topologies (Fat Tree, Dragonfly, and/or Torus)
- Troubleshoot and resolve issues related to cluster performance, hardware failures, and software glitches.