GPU mode & concurrency¶
By default, Atlas runs your jobs without GPUs. However, you can also choose to run your jobs with GPUs.
You can start Atlas with GPU support by running atlas-server start -g
. This will allow Atlas to use all CUDA-enabled GPUs on your system.
What does this mean?
Running Atlas with GPU support provides 2 main benefits:
1: Jobs will have access to the GPUs on your machines / instances
2: By default, there will be as many Atlas workers as GPUs available to Atlas, allowing you to run multiple jobs concurrently
Important Usage Information
This can be very exciting, especially if you have a really powerful machine that you want to use to make the worlds next AlexNet, but with great power comes great responsibility! Below are some gotchas that you should watch for:
- Although Atlas makes sure that a job given 1 GPU can only access that GPU, we don't keep track of RAM or CPU usage and your jobs could clash on those resources if you are not careful
- This feature doesn't magically make your code use all available GPUs, it just makes those GPUs available for your code to use
How to launch jobs that uses GPUs
We have given Atlas access to GPUs, but by simply running foundations submit scheduler . main.py
, we still haven't given the job access to the GPUs — we have to use a special argument --num-gpus #
.
If you are using the Python SDK, you can pass num_gpus=#
to the foundations.submit()
command.
Queue Priority
The scheduler uses a fairly simplistic process for allocating jobs — in which we only get the next job in the queue.
This means that if Atlas has access to 4 GPUs, with 4 workers, and your queue looks like:
- Job1(num_gpus=2), Job2(num_gpus=2), Job3(num_gpus=3), Job4(num_gpus=3), Job5(num_gpus=1)
Job1 and Job2 will run, both taking 2 GPUs. Job3 will hold up the rest of the queue since there are not 3 GPUs available.
Once Job1 and Job2 stop, Job3 will start running with 3 GPUs. Job4 will hold up the rest of the queue since it doesn't have enough resources.
Limiting GPU access to Atlas
By default Atlas will have access to all GPUs on the host machine. You can specify specific GPU availability by setting the CUDA_VISIBLE_DEVICES
environment variable (more information).
CUDA_VISIBLE_DEVICES
takes in a comma separated string of numbers, where each number is the ID of the GPU you wish to make visible to Atlas.
You can export this variable in the same shell right before starting Atlas:
export CUDA_VISIBLE_DEVICES=0,1
atlas-server start -g
Or set the variable in the Atlas start command:
CUDA_VISIBLE_DEVICES=0,2 atlas-server start -g