I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.
So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.
So the original text is like this (in broad Finno-Ugric Transcription):
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.
And I would need it in a form like this:
мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.
This continues for some thousand lines.
There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.
I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö. мӧдiс иван
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.
Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!
Niko