when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan

Question

when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan

217 views Asked by Salan Ushek At 11 October 2023 at 04:19

I try to use accelerate fsdp at 2 A40 gpus, but the trainning with nan loss and -inf weight.

by debug the code,I find that CLIPVisionModel(loading at the init beginning and not the trainning model) weight are wrong,

at one process I find the wright CLIPVisionModel weight:
tensor([0.3311, 0.0032, 0.1610, ..., 2.1922, 0.0050, 0.0039],)

and the other process, the CLIPVisionModel weight is wrong:
tensor([-1.9921e-04, 4.5673e-41, -1.9921e-04, ..., 0.0000e+00,0.0000e+00, 0.0000e+00],)

and the wrong CLIPVisionModel weight case the hidden parameters are inf or nan ,and the trainning loss is nan

this is my loading code :

class SLlamaModel(LlamaModel): config_class = SConfig

def __init__(self, config: LlamaConfig, mm_vision_tower=None, mm_hidden_size=None):
    super(SLlamaModel, self).__init__(config)

    if hasattr(config, "mm_vision_tower"):
        # HACK: for FSDP
        self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)

    if hasattr(config, "use_mm_proj"):
        self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)

config.mm_vision_tower = path to clip-vit-large-patch14

self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)

here is my fsdp config:

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

two A40 gpus: 48G VRAM per

Accelerate version: 0.21.0
Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.31
Python version: 3.10.12
Numpy version: 1.25.1
PyTorch version (GPU?): 2.0.1+cu117 (False)
PyTorch XPU available: False
PyTorch NPU available: False
System RAM: 755.74 GB

I debug the whole code and find the weight error from the beginning when loading by from_pretrained and not other code error.I try to using test = CLIPVisionModel.from_pretrained("model path") at debug when the code run to loading line and still got error weight

Original Q&A

There are 1 answers

**dadaamin** · Answer 1 · 2024-01-30T13:40:10+00:00

dadaamin On 30 January 2024 at 13:40

I think I ran into the same issue. Although I still don't understand how this happens, upgrading transformers to version 4.37.2 solved it for me.

TechQA.

when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan

There are 1 answers

Related Questions in PYTORCH

Related Questions in NLP

Related Questions in COMPUTER-VISION

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in ACCELERATE

Popular Questions

Trending Questions