Walkthrough¶
Atlas is well suited for running large hyperparameter searches across multiple projects. Atlas' built-in scheduler allows for queuing of multiple jobs for execution inside a Worker. The contents of the users current working directory are copied to an Atlas working directory and mounted inside the Worker. The results are then stored in an archive location that can be accessed using the CLI.
Note
Atlas comes with the Scheduler (a Docker based local scheduler) out of the box. The installation also sets up a configuration file called scheduler.config.yaml
, this is the scheduler_config
that is used for most CLI and SDK commands.
See here to see how you can create a new scheduler config to launch to a remote machine.
Creating a project¶
Use the foundations init <project_name>
command to create a template project directory. This command will create a project directory in your current directory with the following contents:
foundations init my_project
my_project |--- README.txt |--- data |--- job.config.yaml |--- main.py
The template provides a sample main.py
which logs a few arbitrary metrics & saves an artifact to demonstrate in the GUI.
We can use this main.py
to submit our first job to the Scheduler.
Also included in the project directory is an empty data folder, a README and a sample job configuration job.config.yaml
.
We will explore the job configuration in more detail in a later section.
Submitting a single job to the Scheduler¶
The foundations submit
CLI command is used to submit jobs to the Scheduler.
We can submit our main.py
to the Scheduler for execution as follows:
foundations submit scheduler . main.py
Once this command is executed, we'll start seeing logs streaming in the console:
Foundations INFO: Job submission started. Ctrl-C to cancel. Foundations INFO: Preparing to bundle contents of /home/<user>/code/my_project for execution. Estimating bundle size. Foundations INFO: Bundling job contents. Foundations INFO: Job submitted with ID '45540b52-ffe2-44ab-9838-db467d3499c0'. Foundations INFO: Job queued. Ctrl-C to stop streaming - job will not be interrupted or cancelled. Foundations INFO: Job running, streaming logs. ======================================================================================================================== No user requirements found. ======================================================================================================================== Foundations INFO: Job '45540b52-ffe2-44ab-9838-db467d3499c0' has completed.
We can also view the job status and any captured metadata in the GUI as usual.
This CLI command takes three positional arguments, scheduler_config
, job_dir
and command
.
The scheduler_config
is scheduler
which will be the case for all Atlas job submissions.
The job_dir
refers to the project directory which is .
and command
in this case is main.py
.
command
refers to the Docker command to run inside the worker container. We can pass add additional arguments to command
in case our script accepts command-line arguments. Please refer to the CLI documentation for additional details on the foundations submit
command.
Adding project requirements¶
The Atlas Worker is based on the official tensorflow image and comes pre-configured with some common dependencies like scikit-learn
and xgboost
.
However, if certain project-specific python packages are required, they can be added by adding a requirements.txt
to the project directory. The Atlas Worker is configured to install the requirements.txt at start-up.
Note
The requirements.txt is installed everytime a job is launched. It is recommended to use a custom Worker with project requirements pre-installed to avoid the start-up delay.
Hyperparameter searches¶
Atlas makes it really easy to optimize your model by supporting multi-job execution, as well as allowing you to load parameter values during runtime. During a hyperparameter search, your jobs are queued in the Scheduler. Atlas also exposes various ways of interacting with the job queue (via the CLI or GUI) Furthermore, Atlas automatically tracks the parameter values between different jobs,
The recommended way to launch a hyperparameter search, we make sure of the foundation.submit()
function.
This function is the SDK counter-part of the foundations submit
CLI command and allows your to programmatically submit a large number of jobs to be executed without tying up your console.
There are two important steps when launching a hyperparameter search:
1. Pass in a dict
of parameters to foundations.submit()
using the params
argument
2. Loading in these parameters in the script passed to to the command
argument using foundations.load_parameters()
Let's go through an example below:
Launching a hyper-parameter search¶
In this random search example, we specify a few hyper-parameter ranges as a dict
. These are then passed into foundation.submit()
as shown below:
Random Search Example:
import os os.environ["FOUNDATIONS_COMMAND_LINE"] = True # Required so that an extra job is not created when launching the hyperparameter search import foundations param_ranges = { "epochs": { "min": 5, "max": 40, }, "layer_shapes": { "min": 256, "max": 768, "count_min": 1, "count_max": 3 } "early_stopping_tolerance": 0.01, "batch_size": [64, 128, 1024] } for _ in range(5): params = some_random_parameter_generator(param_ranges) foundations.submit(scheduler_config="scheduler", command=["main.py"], params=params)
some_random_parameter_generator
is an implementation of your choice.
This will queue up multiple jobs with the Scheduler that will be executed sequentially.
Each of these job directories for these experiments will contain a foundations_job_parameters.json
file that is generated by foundations.submit()
.
This parameters file now needs to be loaded up at runtime so that the hyperparameter are available at trainig time.
Loading in parameter values¶
The block below shows a sample foundations_job_parameters.json
file that is generated for one of the queued jobs.
foundations_job_parameters.json file:
{ "epochs": 6, "batch_size": 64, "layer_shapes": [128, 256, 192, 128], "early_stopping_tolerance": 0.01, }
command
using the foundations.load_parameters()
function as shown below:
#Example main.py import foundations params = foundations.load_parameters() ... train_model(params["learning_rate"], params["epochs"])
This not only makes it easy to specify a lot of different parameter values in one centralized location, but also makes tracking easier when running multiple jobs.
In addition, by using this function, Atlas will automatically track the parameter values for the job on the GUI and SDK so manual parameter logging using foundations.log_param()
is not required.
See the log_param docs for more info.
Retrieving logs and job archives¶
Viewing logs¶
Once we submit a hyper-parameter search, the console will only stream logs related to the launch of the hyperparameter search. To view logs associated with the actual job execution, copy the Job UUID from the GUI and use the following CLI command:
foundations get logs scheduler <job_id>
This command will retrieve logs for the given job from the scheduler
execution environment, which is the default environment for Atlas.
Retrieving archives¶
Now that we've run a few jobs, lets retrieve the archive for one of the jobs and see how Atlas provides us with experiment version control.
Copy the UUID for one of the jobs from the GUI and execute the following command in the console:
foundations get job scheduler <job_id>
This command will retrieve the job bundle from the scheduler
execution environment, which is the default environment for Atlas.
The job bundle contains the state of the directory at the time the job was executed.
This creates an audit trail for all experiments and their associated artifacts in case a specific model and its code need to be retrieved.
Please refer to the CLI reference for additional information.
Interacting with Jobs¶
Stopping a running job¶
A running job can be stopped using:
foundations stop job scheduler <job_id>
Clearing a job queue¶
Jobs queued for execution in the Scheduler can be cleared using:
foundations clear-queue scheduler
Note:
- This command currently clears the entire scheduler queue and affects all projects
- Clearing a job queue does not currently delete the associated job archives.
These will need to be cleared manually for now and can be found under ~/.foundations/job_data/
*
Deleting jobs¶
Jobs can be deleted using the following command:
foundations delete job scheduler <job_id>
Only completed or failed jobs can be deleted. Deleting a job removes it from the GUI and deletes the associated job archive. Note: sudo access will be required for deleting jobs, this means that you will be prompted for a password
Job configuration¶
The job configuration file allows for configuration of job related metadata, execution environments & resources.
A template job.config.yaml
is generated when creating a project using the foundations init <project_name>
command.
Tip
Additional volumes mounted into the docker container within the job.config.yaml (as shown in the example above) need to exist on the server-side machine.
Custom workers¶
Using the standard Worker as a base¶
Custom Workers allow for the end-user to create or use customized execution environments.
The simplest use case for custom Workers is to create a worker pre-populated with a requirements.txt
to avoid the installation of requirements
at job launch.
There are two steps required to create a custom Worker
-
Worker image creation
-
Worker specification in
job.config.yaml
The code block below shows a sample Dockerfile to create a custom Worker.
This image will install project specific requirements after inheriting from the base Atlas worker and will set the entrypoint to Python.
The entrypoint override is recommended since the standard Worker image entrypoint installs any requirements.txt
present in the project directory.
Dockerfile
FROM atlas-ce/worker:latest COPY ./requirements.txt /tmp/requirements.txt RUN pip install --requirement /tmp/requirements.txt RUN rm /tmp/requirements.txt ENTRYPOINT ["python"]
This image can then be built and tagged using: docker build . -t myCustomWorker:latest
. This will load the docker image into your local docker registry
To use this image, update job.config.yaml
as below:
worker: image: myCustomWorker:latest
Using a different image as a base¶
Atlas also supports using a different image as a base, however, in this case, some SDK specific dependencies must be installed. To create a custom Worker using a different base, install the following requirements into the image using the steps above:
wheel request jsonschema dill==0.2.8.2 redis==2.10.6 pandas==0.23.3 google-api-python-client==1.7.3 google-auth-httplib2==0.0.3 google-cloud-storage==1.10.0 PyYAML==3.13 pysftp==0.2.8 paramiko==2.4.1 mock==2.0.0 freezegun==0.3.8 boto3==1.9.86 boto==2.49.0 flask-restful==0.3.6 Flask==1.1.0 Werkzeug==0.15.4 Flask-Cors==3.0.6 mkdocs==1.0.4 promise==2.2.1 pyarmor==5.5.6 slackclient==1.3.0