I have a corpus of text and I would like to find embeddings for words starting from a characters. So I have a sequence of characters as input and I want to project it into a multidimensional space.
As an initialization, I would like to fit already learned word embeddings (for example, the Google ones).
I have some doubts:
- Do I need use a character embedding vector for each input character in the input sequence? would it be a problem if I use simply the ascii or utf-8 encoding ?
- despite of what be the input vector definition (embedding vec, ascii,..)it's really confusing to select a proper model there are several options but im not sure which one is the better choice :seq2seq, auto-encoder, lstm, multi-regressor+lstm ?
- Could you give me any sample code by keras or tensorflow?
I answer each question:
If you want to exploit characters similarities (that area far relatives of phonetic similarities too), you need an embedding layer. Encodings are symbolic inputs while embeddings are continuous inputs. With symbolic knowledge any kind of generalization is impossible because you have no concept of distance (or similarity), while with embeddings you can behave similarly with similar inputs (and so generalizing). However, since the input space is very small, short embeddings are sufficient.
The model highly depends on the kind of phenomena you want to capture. A model that I see often in literature and seems working well in different task is a multilayer bidirectional-lstm on the characters with a linear layer in the top.
The code is similar to all the RNN implementation of Tensorflow. A good way to start is the Tensorflow tutorial https://www.tensorflow.org/tutorials/recurrent. The function for creating the bidirectional is https://www.tensorflow.org/api_docs/python/tf/nn/static_bidirectional_rnn
From experience, I had problems to fit word-based word embeddings using a character model. The reason is that a word-based model will put morphologically similar words very far if there is no semantic similarities. A character-based model can't do it because morphologically similar input cannot be distinguished very well (are very close in the embedded space).
This is one of the reason why, in literature, people often use characters-models as a plus to word models and not as "per se" models. It is an open research area if a character model can be enough to capture both semantic and morphological similarities.