I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings. I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.