In machine translation, sequence-to-sequence models have become very popular. They often use a few tricks to improve performance, such as ensembling or averaging a set of models. The logic here is that the errors will then "average out".
As I understand, averaging a model is simply taking the average of the parameters of X models and then create a single model that can be used to decode test data. Ensembling, however averages each of the models output. This is much more resource intensive, since X models have to provide an output, whereas an averaged model is only run once on the test data.
What exactly is the difference here? How does the output differ? In my tests, both methods gave a small and similar improvement over a baseline score. It makes you wonder why people bother with ensembles if they can also just average. However, in all Neural Machine Translation papers I come across, people talk about ensembling and not about averaging. Why is this? Are there any papers on averaging (specifically seq2seq and machine translation related papers)?
Any help is greatly appreciated!
Ensembling is a more general term. Bagging and boosting are examples of ensemble methods.
For example random forest doesn't just average decision trees, it uses bagging - first randomly samples data and features and then trains trees on that (using all data/features wouldn't make much sense, since trees would be really similar).