I'm trying to train a Dreambooth using the following command on Kaggle:
!accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--train_data_dir="dataset/img" \
--reg_data_dir="dataset/reg" \
--output_dir="output" \
--output_name="SDXLDreambooth" \
--save_model_as="safetensors" \
--train_batch_size=1 \
--max_train_steps=8000 \
--save_every_n_steps=4001 \
--optimizer_type="adafactor" \
--optimizer_args scale_parameter=False relative_step=False warmup_init=False \
--xformers \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--learning_rate=2.5e-6 \
--max_grad_norm=0.0 \
--resolution="1024,1024" \
--save_precision="fp16" \
--save_n_epoch_ratio=1 \
--max_data_loader_n_workers=1 \
--persistent_data_loader_workers \
--mixed_precision="fp16" \
--full_fp16 \
--logging_dir="logs" \
--log_prefix="last" \
--gradient_checkpointing \
--caption_extension=".txt" \
--no_half_vae \
--cache_latents
and adding --train_text_encoder gives an out of memory error on a P100 GPU with 16 GB VRAM, even with all optimisations on. I've tested running the same command on an L4 GPU with 24 GB VRAM on Modal, and it runs successfully with a maximal VRAM utilization of around 18 GB.
However, I've noticed Kaggle provides also the option of using two T4 GPUs with 16 GB VRAM each, which led me to wonder as to whether it's possible to alter the configuration of the environment there (including e.g. changing the script, the DeepSpeed configuration, the accelerate configuration, etc.) to allow memory sharing across the 2 GPUs, whose total combined memory (32 GB) should allow the command to run (which requires 18 GB of memory).
The default behaviour set by Kaggle seems to be to copy the configuration across the 2 GPUs and process the training in parallel, so with the --train_text_encoder option the script would require 18 GB of each GPU, leading to an out of memory error.
How should I configure the environment so as to allow for memory sharing across both GPUs, and avoid the out of memory error I'm receiving?
Edit: Here are some links to a sample notebook and some runs:
Neither of these have --train-text-encoder enabled, as it leads to an OOM error. Here's the memory utilization for the T4 GPUs in the first run: