Core concepts
When training a model, people write programs which typically follow a similar pattern:
- Loading data samples & instantiating the model,
- Feeding the model batches of sample-label pairs, which are passed through the model forward pass,
- Computing the loss as a difference between the predicted labels and the ground truth labels,
- This error is propagated backwards using backpropagation,
- Updating the model parameters using an optimizer.
During each iteration, the program can also collect statistics (such as the training / validation loss & accuracy) and optionally save the weights of the resulting model to file.
This typical workflow led us to the formalization of the core concepts of the framework:
- Problem: A dataset or a data generator, returning a batch of inputs and ground truth labels used for a model training/validation/test,
- Model: A trainable model (i.e. a neural network),
- Worker: A specialized application that instantiates the Problem & Model objects and controls the interactions between them, e.g. during training or inference,
- Configuration file(s): YAML file(s) containing the parameters of the Problem, Model and training procedure,
- Experiment: A single run (training & validation or test) of a given Model on a given Problem, using a specific Worker and Configuration file(s).
Aside of the Workers, MI-Prometheus currently offers 2 other types of specialized applications, namely:
- Grid Worker: A specialized application which automates the handling of a number of experiments in parallel.
- Helper: An application useful from the point of view of a running experiment, but which is independent and external to the Workers.
The general idea here is that the Grid Workers are useful to reproduce research, e.g. when one trains a set of independent models on a set of problems and compare the results. In such a situation, the user can use a Helper to download the required datasets (before training) and/or preprocess them in a specific way (e.g. extract representations), which will reduce the overall time of all experiments.
You can read more about MI-Prometheus here.