I am trying to capture multiple groups recursively in a string using also a backreference to a group within the regex. Even though I am using a Pattern and a Matcher and a "while(matcher.find())" loop, it is still only capturing the last instance instead of all the instances. In my case the only possible tags are <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>. Since these are formatting tags, I need to capture:
- any text outside of a tag (so that I can format it as "normal" text, and I am going about this by capturing any text before a tag in one group while I capture the tag itself in another group, and as I iterate through the occurrences I remove everything that has been captured from the original String; if I have any text left over in the end I format that as "normal" text)
- the "name" of the tag so that I know how I will have to format the text inside the tag
- the text contents of the tag that will be formatted accordingly to the tag name and its associated rules
Here is my sample code:
String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>";
String remainingText = currentText;
//first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
{
//an opening or closing tag has been found, so let us start our pattern captures
//I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher1 = pattern1.matcher(currentText);
int iteration = 0;
while(matcher1.find()){
System.out.print("Iteration ");
System.out.println(++iteration);
System.out.println("group1:"+matcher1.group(1));
System.out.println("group2:"+matcher1.group(2));
System.out.println("group3:"+matcher1.group(3));
System.out.println("group4:"+matcher1.group(4));
if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
{
m_xText.insertString(xTextRange, matcher1.group(1), false);
remainingText = remainingText.replaceFirst(matcher1.group(1), "");
}
if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
{
switch (matcher1.group(2)) {
case "pof": [...]
case "pos": [...]
case "poif": [...]
case "po": [...]
case "poi": [...]
case "pol": [...]
case "poil": [...]
case "sm": [...]
}
remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
}
}
The System.out.println is only outputting once in my console, with these results:
Iteration 1:
group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>;
group2:poil
group3:po
group4:for out of man this one has been taken.”
Group 3 is to be ignored, the only useful groups are 1, 2 and 4 (group 3 is part of group 2). Why is this only capturing the last tag instance "poil", while it is not capturing the preceding "pof", "poi", and "po" tags?
The output I would like to see would be like this:
Iteration 1:
group1:the man said:
group2:pof
group3:po
group4:“This one, at last, is bone of my bones
Iteration 2:
group1:
group2:poi
group3:po
group4:and flesh of my flesh;
Iteration 3:
group1:
group2:po
group3:po
group4:This one shall be called ‘woman,’
Iteration 3:
group1:
group2:poil
group3:po
group4:for out of man this one has been taken.”
I just found the answer to this problem, it simply needed a non-greedy quantifier in the first capture, just like I had in the fourth capture group. This is working exactly as needed: