How to find all expressions ending with "<tr" (stringi package)

85 views Asked by At

I have some string which actually is HTML code to create table, for example

z <- "<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">
   <CAPTION><B>MESA HIGH VICTORIES</B></CAPTION>
   <TH>Team</TH>
   <TH>Score</TH>
   <TR ALIGN=\"CENTER\">
   <TD><B>Parkfield High Demons</B></TD>
   <TD><B>28 to 21</B></TD>
   </TR>
   <TR ALIGN=\"CENTER\">
   <TD><B>Burns High Badgers</B></TD>
   <TD><B>14 to 13</B></TD>
   </TR>
   </TABLE>"

I want to extract the expression

<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\"> <CAPTION><B>MESA HIGH VICTORIES</B></CAPTION> <TH>Team</TH> <TH>Score</TH> <TR

So I want extract the parts of the string which starts with <TABLE and ends with first "<TR"

The best I could do is to use function from stringi package

stri_extract_all_regex(z, "(?i)\\<table.*?\\>(\\s+)?(\\<caption,*? \\>)?")

The output

[[1]] [1] "<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">\n "

but still it is not what I ment. The only obligatory part of the string before first "<TR" is "<TABLE" with some settings, caption and headers are optional. And ideas how to create proper regex for it?

2

There are 2 answers

1
vks On BEST ANSWER
<TABLE\b[^>]+>[\s\S]+?<TR

Try this.See demo.

http://regex101.com/r/vF0kU2/7

0
Jim On

Using rex may make this type of task a little simpler.

z <- "<TABLE ALIGN=\"RIGHT\" BORDER CELLSPACING=\"0\" CELLPADDING=\"0\">
   <CAPTION><B>MESA HIGH VICTORIES</B></CAPTION>
   <TH>Team</TH>
   <TH>Score</TH>
   <TR ALIGN=\"CENTER\">
   <TD><B>Parkfield High Demons</B></TD>
   <TD><B>28 to 21</B></TD>
   </TR>
   <TR ALIGN=\"CENTER\">
   <TD><B>Burns High Badgers</B></TD>
   <TD><B>14 to 13</B></TD>
   </TR>
   </TABLE>"

library(rex)
re_matches(z,
  rex(
    capture(name='table',
      "<TABLE", zero_or_more(any, type = 'lazy'), "<TR"
    )
  ), options='single-line')

However I would not suggest parsing HTML with regular expressions at all. You may want to look into using the XML package or rvest instead.