Need to split Unicode string

2.4k views Asked by At

I am using the moses toolkit for my translation system. I am using Assamese and English parallel corpus and trained them. But some proper nouns are not translated. This is because I have a very small corpus (parallel data set). So I want to use the transliteration process in my translation system.

I am using this command for my translation: echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini

This gave me the output "কানাদা is a vast country".

This is because the word "কানাদা" is not in my parallel corpus.

So I took some parallel list of words in Assamese and English, and break each word character-wise. Thus, each line of the two files would have single words with a space between each character (or each syllable). i have used these 2 files to train the system as normal translation task

Then I used the following command echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl

This gave me the output "ক া ন া দ া is a vast country"

I had to break the word because i have trained the system character-wise..

Then i used the transliteration system that i have trained using the command:

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini

This gave me the output "c a n a d a is a vast country"

The characters are transliterated..but the only problem is the spaces between the word.So i want to use a perl file that will join the word. My final command will be

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl

Help me with this "join.pl" file.

4

There are 4 answers

11
Toto On BEST ANSWER

How about:

use utf8;
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
say $str;

output:

ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত

You can use it in your program, just change the while loop to:

while(<>) {
    s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    print $_;
}

But I think you whish to do:

my %corresp = (
    'ভ' => 'Bh',
    'া' => 'a',
    'ৰ' => 'ra',
    'ত' => 't',
);
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
say $str;

Output:

Bharat is a famous country. দিল্লী is the capital of Bharat

NB: It's up to you to build the true corresponding hash. I don't know anything about Assamese characters.

0
David W. On

You can use \p{...} and \P{...} which will allow you to match or not match particular character classes as specified in perluniprops.

I'm using \P{Latin} which selects non-Latin characters , and \s in order not to match spaces:

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);

use utf8;
binmode(STDOUT, ':utf8');  # Why is this needed when you specify "use utf8;"?

my $string = "ভাৰত is a famous country";
$string =~ s/([^\p{Latin}\s])/$1 /g;  # Put a space after all non-latin chars
say $string;

This will print out:

ভ া ৰ ত  is a famous country

The only problem is that double space after .

6
terdon On

It's doing exactly what you tell it to. @a=split('') will split the entire line, you are not telling it to only split the first word. You will first need to identify the substring you want to split and then split it:

#!/usr/bin/perl
use utf8;
use Getopt::Std;
use IO::Handle;

binmode(STDIN,  ':utf8');
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

while(<>)
{
    chomp;
    ## find the first word, capture it as $1 and delete it from the line
    s/(.+?)\s//;
    @a=split('',$1);
    ## Print your joined string and the rest of the line
    print join(" ",@a) . " $_\n";
}
0
Joop Eggen On

Add something like

$str =~ s/([\w]) (?<=[\w.,;:!?])/$1/g;

which intends to remove the space between latin word chars. With a look-ahead. Not 100%.