Fine-tuning a model on sequences longer than the max sequence input length

125 views Asked by At

For a research I am doing, I am trying to fine-tune BioGPT-Large on data from WikiPathways, which are just pathways with all the genes that belong to that pathway. The max sequence length for BioGPT-Large is 1024, and most sequences are longer than that, even up to 30k tokens. I could ofcourse truncate all the sequences to 1024 or split the sequences into smaller chunks, but data would get lost that way. I am pretty stuck right now, any solutions?

I tried truncating and splitting into smaller chunks. Truncating result in the loss of information and splitting into chunks makes it so that information that is supposed to be in one sequence, gets split into smaller chunks and thus information is also lost.

1

There are 1 answers

6
Peter On

I'm not sure how your data is encoded. ea genetic code is just a few base pairs isnt it?. a byte has 255 options.. but when it's used to text maybe you have less available. Though you probably find something that works better than base64 encoding. Think of it though, as you will keep the pattern you feed it.

Another option, but this one is highly theoretical, almost philosophical perhaps. Compress it zip it, bwah but that would destroy the sequences.. yes. though zip files are unique too, they are the minimal representation of data. As LLMs are behemoths in pattern detection (I never tested this ) perhaps it will work, as you end up with quite an essential data presentation of the pattern. Which when zipped might require you to do again that re-encode to keep it within the allowed characters.

Another option but you tried it before the chunky way make sure the chunks are from random positions, might take a lot longer to train but eventually it might learn to where chunks belong too, you might do it randomly, or fixed cuts, and add the cutting values as well tell it where the chunks where made, giving it some extra info to do its job.