Tensorflow + LSF. Distributed tensorflow on LSF cluster

749 views Asked by At

How to setup tensorflow to work with LSF job scheduler? I have almost no experience with LSF. tf.train.ClusterSpec needs ip addresses of workers and parameter servers. Is it possible to obtain them from the LSF environment? Are there any success stories of making them work together?

EDIT:

Found some explanations how to achieve similar goal on Slurm cluster Running TensorFlow on a Slurm Cluster?. Basically, i'm looking for something like this but for LSF job scheduler

2

There are 2 answers

0
Michael Closson On BEST ANSWER

There's a blog post and sample launch script for TensorFlow on LSF here.

0
bR3nD4n On

You could do this on LSF, but I don't recommend it. What i would recommend is that if you can use Docker and go that route. LSF has a pile of other complications that can go wrong. Plus TensorFlow wasn't exactly designed to run on a system like LSF.

Docker Swarm and Compose have worked well in the past for me with this particular problem.