Why special characters like () "" : [] are often removed from data before training translation machine?

Question

Why special characters like () "" : [] are often removed from data before training translation machine?

798 views Asked by phan-anh.tuan At 03 October 2020 at 07:31

I see that people often remove special characters like () "" : [] from data before training translation machine. Could you explain for me the benefits of doing so?

Original Q&A

There are 1 answers

**Wiktor Stribiżew** · Accepted Answer · 2020-10-03T12:32:31+00:00

Date clean-up or pre-processing is performed so that algorithms could focus on important, linguistically meaningful "words" instead of "noise". See "Removing Special Characters":

Special characters, as you know, are non-alphanumeric characters. These characters are most often found in comments, references, currency numbers etc. These characters add no value to text-understanding and induce noise into algorithms.

Whenever this noise finds its way into a model, it can produce output at inference, that contains these unexpected (sequences of) characters, and even affect overall translations. It is a frequent case with brackets in Japanese translations.

TechQA.

Why special characters like () "" : [] are often removed from data before training translation machine?

There are 1 answers

Related Questions in NLP

Related Questions in TOKENIZE

Related Questions in MACHINE-TRANSLATION

Popular Questions

Trending Questions