Tracking separate train/test processes with Trains

183 views Asked by At

In my setup, I run a script that trains a model and starts generating checkpoints. Another script watches for new checkpoints and evaluates them. The scripts run in parallel, so evaluation is just a step behind training.

What's the right Tracks configuration to support this scenario?

2

There are 2 answers

4
Martin.B On BEST ANSWER

disclaimer: I'm part of the allegro.ai Trains team

Do you have two experiments? one for testing one for training ?

If you do have two experiments, then I would make sure the models are logged in both of them (which if they are stored on the same shared-folder/s3/etc will be automatic) Then you can quickly see the performance of each-one.

Another option is sharing the same experiment, then the second process adds reports to the original experiment, that means that somehow you have to pass to it the experiment id. Then you can do:

task = Task.get_task(task_id='training_task_id`)
task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

EDIT: Are the two processes always launched together, or is the checkpoint test a general purpose code ?

EDIT2:

Let's assume you have main script training a model. This experiment has a unique task ID:

my_uid = Task.current_task().id

Let's also assume you have a way to pass it to your second process (If this is an actual sub-process, it inherits the os environment variables so you could do os.environ['MY_TASK_ID']=my_uid)

Then in the evaluation script you could report directly into the main training Task like so:

train_task = Task.get_task(task_id=os.environ['MY_TASK_ID'])
train_task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)
1
Dagan On

@MichaelLitvin, We had the same issue, and also had the same names for everything we logged in train and test, since it comes from the same code (obviously). In order to avoid train/test mess in trains' plots, we modified tensorflow_bind.py to add a different prefix for "train" and "validation" streams. Trains' bugfix was adding a logdir name (which was not that clear for us).

*This was done 1-2 years ago, so it might be redundant now

Cheers, Dagan