Mig Wisdom

[Wisdom dates from 2021-06-25]

We have access to a(t least one) A100 in coba2000.chtc.wisc.edu , although that machine is frequently busy.

The machine gpulab2004.chtc.wisc.edu has four A100s.

For now, to get an A100 from AWS, you have to rent eight of them with the p4d.24xlarge instance type. Use the "Deep Learning Base AMI (Amazon Linux 2)" to avoid having to deal with drivers and belike.

sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install https://research.cs.wisc.edu/htcondor/repo/current/htcondor-release-current.amzn2.noarch.rpm
sudo yum-builddep condor

Then set up your HTCondor build tree in the usual way. (Don't forget the -j BIGNUM ; this instance type has a lot of cores.)

NVidia has a MIG user guide . Of particular note:

  • sudo nvidia-smi -i INDEX -mig 1 enables MIG but does not create any GPU instances. Doing this step but not the next one is the (mis)configuration of relevance to HTCONDOR-476 .
  • sudo nvidia-smi mig -i 1 -cgi 19,19,19,19,19,19,19 -C creates a 7-way split of the A100.