I have some string which actually is HTML code to create table, for example
z <- "<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">
<CAPTION><B>MESA HIGH VICTORIES</B></CAPTION>
<TH>Team</TH>
<TH>Score</TH>
<TR ALIGN=\"CENTER\">
<TD><B>Parkfield High Demons</B></TD>
<TD><B>28 to 21</B></TD>
</TR>
<TR ALIGN=\"CENTER\">
<TD><B>Burns High Badgers</B></TD>
<TD><B>14 to 13</B></TD>
</TR>
</TABLE>"
I want to extract the expression
<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">
<CAPTION><B>MESA HIGH VICTORIES</B></CAPTION>
<TH>Team</TH>
<TH>Score</TH>
<TR
So I want extract the parts of the string which starts with <TABLE
and ends with first "<TR"
The best I could do is to use function from stringi
package
stri_extract_all_regex(z, "(?i)\\<table.*?\\>(\\s+)?(\\<caption,*? \\>)?")
The output
[[1]]
[1] "<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">\n "
but still it is not what I ment. The only obligatory part of the string before first "<TR"
is "<TABLE"
with some settings, caption and headers are optional. And ideas how to create proper regex for it?
Try this.See demo.
http://regex101.com/r/vF0kU2/7