Remote: United States
We are seeking a highly skilled individual to join our team as a Nvidia DGX System Engineer. In this role, you will be responsible for reviewing the current hardware architecture and ensuring its conformity with Nvidia's best practices. You will also play a crucial role in ensuring that future scale-out is not limited by the current design. Additionally, you will provide recommendations for architecture or design changes to maximize the system's utilization potential.
- Perform high-level architecture reviews to ensure conformity with Nvidia's best practices.
- Evaluate the current hardware architecture and identify any limitations for future scale-out.
- Provide recommendations for architecture or design changes to optimize the system's utilization potential.
- Verify the networking environment, including internal-net, external net, ipminet, and ibnet configurations across the networking fabric.
- Validate the storage environment, including NFS server environment configuration and NFS storage availability.
- Prepare DGX Systems for deployment by configuring PXE boot and BMC address, and setting up network boot for K8 nodes.
- Configure the BasePOD Cluster by downloading the current Base Command Manager version and creating bootable media or configuring BMC virtual media boot.
- Install the Head Node by booting into Bright Installer, configuring kernels, hardware, cluster settings, network topology, head node settings, compute node settings, BMC settings, networks (externalnet, internalnet, ipminet), head node interfaces, compute node interfaces, and disk layout.
- Configure cluster settings, including licensing, software image configuration and management, backup of software images, and creation of deployable images (default, DGX, K8).
- Add required kernels and create/assign nodes and categories, ensuring correct validation of nodes and categories.
- Configure network settings, including ibnet network addition and verification, setting IP for BMC interface, and verifying OS-visible interface.
- Configure physical interfaces if necessary and verify head node connectivity to ipminet and reboot.
- Configure and assign disk layouts for node categories, configure node network interfaces, and configure Bright MAC address for PXE boot.
- Configure provisioning interfaces on DGX and K8 nodes, configure physical interfaces on all DGX nodes, and identify cluster nodes.
- Set provisioning MAC addresses, update software images for K8 masters and DGX systems, power on and provision cluster nodes, and deploy Docker, K8s, and Network Operator.
- Run multi-node NCCL test, configure Head Node HA, verify power-off of cluster nodes prior to configuration, and verify HA setup after configuration.
- Configure users, add users to the system and K8 per customer requirements, and validate user access.
- Perform system health checks on all nodes, including running NVSM Show Health, NVSM Stress Test, GPU-Burn Stress Test, GPU-Burn Tensor Core Stress Test, and Peer-to-Peer Test.
- Perform high-level system validation checks with the customer and ensure project close-out.
- Carry out fundamental testing to verify that the solution has been effectively implemented within the environment in accordance with the criteria set by the customer.
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related field.
- Proven experience in high-level architecture review and hardware design.
- Strong understanding of networking and storage environments.
- Proficiency in configuring and deploying DGX Systems and K8 clusters.
- Solid knowledge of software image configuration and management.
- Experience with system health checks and stress testing.
- Excellent problem-solving and communication skills.
- Ability to work independently and collaborate effectively within a team.
If you are a detail-oriented individual with a strong background in high-level architecture review and a passion for cutting-edge technology, we would love to hear from you!