I'm following this tutorial on training a causal language model from scratch.
In the tutorial they load the standard GPT2 as follows:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)
How can I load the same model, but use my custom fully connected network instead of the standard one? Mainly want to experiment with variations such as more/less layers, different activation functions, etc.
I found the source code here, but it's very convoluted and I can't figure out how to replace the fully connected parts with a custom ones or what structure the custom one should have in the first place (e.g., input/output size).
Update For example, using a FC network as such:
class FC_model(nn.Module):
    def __init__(self):
        super(FC_model, self).__init__()
        self.fc1 = nn.Linear(768,256)
        self.fc2 = nn.Linear(256,256)
        self.fc3 = nn.Linear(256,50000)
    def forward(self, x):
        x = torch.sin(self.fc1(x)) + torch.rand(1)
        x = torch.sin(self.fc2(x))
        x = self.fc3(x)
        return x
 
                        
I'm assuming by the fully connected network you're referring to the Fully Connected (FC) / Linear layer.
The above would show you the modules inside the model:
You can now access and update the FC layer by:
The above is just a sample, you can experiment with different combinations.