how to check if words lie across sentences?

141 views Asked by At

I have a code that reads certain words from text files and displays them in pairs (depending on their occurence in a paragraph- for Ex:

Hi I am <PER>Rita</PER>.I live in <LOC>Canada</LOC>
Hi I am <PER>Jane</PER> and I do not live in <LOC>Canada<LOC/> 

Output

Rita Canada
Jane Canada

(Note:This is not an xml file.)
I wish to output the pair (Rita Canada)=1 [as there is a fullstop between their occurrence] and(Jane Canada)=0 [as no fullstop occurs between them]
Here is my code to output the names paragraph wise. can you help me to identify fullstops?

private static final Pattern personPattern = Pattern.compile("<PER>(.+?)</PER>");
private static final Pattern locationPattern = Pattern.compile("<LOC>(.+?)</LOC>");
for(File file : listOfFiles)
    {
        BufferedReader input = new BufferedReader(new FileReader(file));

        String line = "";
        while((line = input.readLine()) != null)
        {

            ArrayList<String> persons = new ArrayList<String>();
            ArrayList<String> locations = new ArrayList<String>();
            Matcher m_person = personPattern.matcher(line);
            while(m_person.find())
            {
                persons.add(m_person.group(1));

            }

            Matcher m_location = locationPattern.matcher(line);
            while(m_location.find())
            {
                locations.add(m_location.group(1));

            }


            for(int i = 0;i<persons.size();i++)
            {
                for(int j =0 ;j<locations.size();j++)
                {

                System.out.println(persons.get(i) + "\t" + locations.get(j));
                }

            }
1

There are 1 answers

2
Vasili Syrakis On

Does the PER tag always come before the LOC tag? Are they sometimes in different places?

In the below regex, I specified a positive lookahead (?=) with an atomic group contained inside it (?>\.) which matches a \. and fails the match if it does not.

This is then followed by an alternation with a second capture group, so that the pattern can continue to match in the case that there is not a \.

<PER>(.+?)</PER>(?=(?>\.))|<PER>(.+?)</PER>

Capture group 1: Rita

Capture group 2: Jane