I've been running some jax code successfully on a tpu v4-64 slice. However, my slice was preempted and when I recreated the same size slice I am now running into the following error :
"RuntimeError: Unable to initialize backend 'tpu': INTERNAL: SliceBuilder detects hardware error and is stopping TPU slice. (set JAX_PLATFORMS='' to automatically choose an available backend)".
The jax code has not changed between the old and new slice. I tried again to recreate a new v4-64 slice but encountered the same error. The error also always occurs on worker 0.
Any help would be greatly appreciated!
I've tried
- recreating the slice
- launching the command on each worker separately vs using --all-workers