Tracking separate train/test processes with Trains

Question

Tracking separate train/test processes with Trains

198 views Asked by Michael Litvin At 11 June 2020 at 20:05

In my setup, I run a script that trains a model and starts generating checkpoints. Another script watches for new checkpoints and evaluates them. The scripts run in parallel, so evaluation is just a step behind training.

What's the right Tracks configuration to support this scenario?

Original Q&A

There are 2 answers

Dagan On 09 July 2020 at 08:30

@MichaelLitvin, We had the same issue, and also had the same names for everything we logged in train and test, since it comes from the same code (obviously). In order to avoid train/test mess in trains' plots, we modified tensorflow_bind.py to add a different prefix for "train" and "validation" streams. Trains' bugfix was adding a logdir name (which was not that clear for us).

*This was done 1-2 years ago, so it might be redundant now

Cheers, Dagan

**Martin.B** · Accepted Answer · 2020-06-11T20:16:15+00:00

disclaimer: I'm part of the allegro.ai Trains team

Do you have two experiments? one for testing one for training ?

If you do have two experiments, then I would make sure the models are logged in both of them (which if they are stored on the same shared-folder/s3/etc will be automatic) Then you can quickly see the performance of each-one.

Another option is sharing the same experiment, then the second process adds reports to the original experiment, that means that somehow you have to pass to it the experiment id. Then you can do:

task = Task.get_task(task_id='training_task_id`)
task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

EDIT: Are the two processes always launched together, or is the checkpoint test a general purpose code ?

EDIT2:

Let's assume you have main script training a model. This experiment has a unique task ID:

my_uid = Task.current_task().id

Let's also assume you have a way to pass it to your second process (If this is an actual sub-process, it inherits the os environment variables so you could do os.environ['MY_TASK_ID']=my_uid)

Then in the evaluation script you could report directly into the main training Task like so:

train_task = Task.get_task(task_id=os.environ['MY_TASK_ID'])
train_task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

TechQA.

Tracking separate train/test processes with Trains

There are 2 answers

Related Questions in TRAINS

Related Questions in CLEARML

Popular Questions

Popular Tags

Trending Questions