when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan
I try to use accelerate fsdp at 2 A40 gpus, but the trainning with nan loss and -inf weight.
by debug the code,I find that CLIPVisionModel(loading at the init beginning and not the trainning model) weight are wrong,
at one process I find the wright CLIPVisionModel weight:
tensor([0.3311, 0.0032, 0.1610, ..., 2.1922, 0.0050, 0.0039],)
and the other process, the CLIPVisionModel weight is wrong:
tensor([-1.9921e-04, 4.5673e-41, -1.9921e-04, ..., 0.0000e+00,0.0000e+00, 0.0000e+00],)
and the wrong CLIPVisionModel weight case the hidden parameters are inf or nan ,and the trainning loss is nan
this is my loading code :
class SLlamaModel(LlamaModel): config_class = SConfig
def __init__(self, config: LlamaConfig, mm_vision_tower=None, mm_hidden_size=None):
super(SLlamaModel, self).__init__(config)
if hasattr(config, "mm_vision_tower"):
# HACK: for FSDP
self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)
if hasattr(config, "use_mm_proj"):
self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
config.mm_vision_tower = path to clip-vit-large-patch14
self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)
here is my fsdp config:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
two A40 gpus: 48G VRAM per
Accelerateversion: 0.21.0- Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.31
- Python version: 3.10.12
- Numpy version: 1.25.1
- PyTorch version (GPU?): 2.0.1+cu117 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 755.74 GB
I debug the whole code and find the weight error from the beginning when loading by from_pretrained and not other code error.I try to using test = CLIPVisionModel.from_pretrained("model path") at debug when the code run to loading line and still got error weight
I think I ran into the same issue. Although I still don't understand how this happens, upgrading transformers to version 4.37.2 solved it for me.