I want to continue pre-training Llama-2-7b-hf with T4 GPU on colab.
Here is my training script:
output_model=./output
if [ ! -d ${output_model} ];then
mkdir ${output_model}
fi
cp ./pretrain.sh ${output_model}
cp ./ds_config_zero*.json ${output_model}
deepspeed --num_gpus 1 pretrain_clm.py \
--model_name_or_path ../../../Llama-2-7b-hf \
--train_files ../../data/train_sft.csv \
--validation_files ../../data/dev_sft.csv \
../../data/dev_sft_sharegpt.csv \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--do_train \
--output_dir ${output_model} \
--evaluation_strategy steps \
--use_fast_tokenizer false \
--max_eval_samples 500 \
--learning_rate 3e-5 \
--gradient_accumulation_steps 4 \
--num_train_epochs 1 \
--warmup_steps 2 \
--logging_dir ${output_model}/logs \
--logging_strategy steps \
--logging_steps 2 \
--save_strategy steps \
--preprocessing_num_workers 10 \
--save_steps 500 \
--eval_steps 500 \
--save_total_limit 2000 \
--seed 42 \
--disable_tqdm false \
--ddp_find_unused_parameters false \
--block_size 4096 \
--overwrite_output_dir \
--report_to tensorboard \
--run_name ${output_model} \
--fp16 \
--fp16_full_eval \
--gradient_checkpointing \
--deepspeed ./ds_config_zero2.json \
--ignore_data_skip true \
--ddp_timeout 18000000 \
| tee -a ${output_model}/train.log
# --resume_from_checkpoint ${output_model}/checkpoint-20400 \
After executing the above script, the following error occurs:
[2023-12-19 11:48:57,840] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 16247
[2023-12-19 11:48:57,841] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', 'pretrain_clm.py', '--local_rank=0', '--model_name_or_path', '../../../Llama-2-7b-hf', '--train_files', '../../data/train_sft.csv', '--validation_files', '../../data/dev_sft.csv', '../../data/dev_sft_sharegpt.csv', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--do_train', '--output_dir', './output', '--evaluation_strategy', 'steps', '--use_fast_tokenizer', 'false', '--max_eval_samples', '500', '--learning_rate', '3e-5', '--gradient_accumulation_steps', '4', '--num_train_epochs', '1', '--warmup_steps', '2', '--logging_dir', './output/logs', '--logging_strategy', 'steps', '--logging_steps', '2', '--save_strategy', 'steps', '--preprocessing_num_workers', '10', '--save_steps', '500', '--eval_steps', '500', '--save_total_limit', '2000', '--seed', '42', '--disable_tqdm', 'false', '--ddp_find_unused_parameters', 'false', '--block_size', '4096', '--overwrite_output_dir', '--report_to', 'tensorboard', '--run_name', './output', '--fp16', '--fp16_full_eval', '--gradient_checkpointing', '--deepspeed', './ds_config_zero2.json', '--ignore_data_skip', 'true', '--ddp_timeout', '18000000'] exits with return code = -9
I'm wondering if it's because I don't have enough VRAM. If so, I would like to know how much VRAM is needed. Or use Quantization, PEFT, LoRA and other technologies to pre-train?