Exits with return code = -9 when pretrain llama2

166 views Asked by At

I want to continue pre-training Llama-2-7b-hf with T4 GPU on colab.

Here is my training script:

output_model=./output
if [ ! -d ${output_model} ];then  
    mkdir ${output_model}
fi
cp ./pretrain.sh ${output_model}
cp ./ds_config_zero*.json ${output_model}

deepspeed --num_gpus 1 pretrain_clm.py \
    --model_name_or_path ../../../Llama-2-7b-hf \
    --train_files ../../data/train_sft.csv \
    --validation_files  ../../data/dev_sft.csv \
                         ../../data/dev_sft_sharegpt.csv \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --output_dir ${output_model} \
    --evaluation_strategy  steps \
    --use_fast_tokenizer false \
    --max_eval_samples 500 \
    --learning_rate 3e-5 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --warmup_steps 2 \
    --logging_dir ${output_model}/logs \
    --logging_strategy steps \
    --logging_steps 2 \
    --save_strategy steps \
    --preprocessing_num_workers 10 \
    --save_steps 500 \
    --eval_steps 500 \
    --save_total_limit 2000 \
    --seed 42 \
    --disable_tqdm false \
    --ddp_find_unused_parameters false \
    --block_size 4096 \
    --overwrite_output_dir \
    --report_to tensorboard \
    --run_name ${output_model} \
    --fp16 \
    --fp16_full_eval \
    --gradient_checkpointing \
    --deepspeed ./ds_config_zero2.json \
    --ignore_data_skip true \
    --ddp_timeout 18000000 \
    | tee -a ${output_model}/train.log
    
    # --resume_from_checkpoint ${output_model}/checkpoint-20400 \

After executing the above script, the following error occurs:

[2023-12-19 11:48:57,840] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 16247
[2023-12-19 11:48:57,841] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', 'pretrain_clm.py', '--local_rank=0', '--model_name_or_path', '../../../Llama-2-7b-hf', '--train_files', '../../data/train_sft.csv', '--validation_files', '../../data/dev_sft.csv', '../../data/dev_sft_sharegpt.csv', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--do_train', '--output_dir', './output', '--evaluation_strategy', 'steps', '--use_fast_tokenizer', 'false', '--max_eval_samples', '500', '--learning_rate', '3e-5', '--gradient_accumulation_steps', '4', '--num_train_epochs', '1', '--warmup_steps', '2', '--logging_dir', './output/logs', '--logging_strategy', 'steps', '--logging_steps', '2', '--save_strategy', 'steps', '--preprocessing_num_workers', '10', '--save_steps', '500', '--eval_steps', '500', '--save_total_limit', '2000', '--seed', '42', '--disable_tqdm', 'false', '--ddp_find_unused_parameters', 'false', '--block_size', '4096', '--overwrite_output_dir', '--report_to', 'tensorboard', '--run_name', './output', '--fp16', '--fp16_full_eval', '--gradient_checkpointing', '--deepspeed', './ds_config_zero2.json', '--ignore_data_skip', 'true', '--ddp_timeout', '18000000'] exits with return code = -9

I'm wondering if it's because I don't have enough VRAM. If so, I would like to know how much VRAM is needed. Or use Quantization, PEFT, LoRA and other technologies to pre-train?

0

There are 0 answers