R - checking HTML for formatting tags (bold, italics etc.)

504 views Asked by At

I am using edgarWebR to parse 10K (SEC EDGAR) filings. I am trying to write an algorithm to deduce whether each HTML element is normal text, a subheading or a heading by checking how the document is formatted (e.g. some 10Ks might have all headings in bold italics, and subheadings in just italics)

edgarWebR returns a dataframe with each element corresponding to a row, containing the text and html. An example of some html:

<p style="margin-top:18px;margin-bottom:0px"><font style="font-family:ARIAL" size="2"><b><i>Our quarterly operating results have fluctuated in the past and might continue to fluctuate, causing the value of our common stock to decline substantially. </i></b></font></p>

As we can see, the above should be flagged as bold and italic. However, this is represented differently in different filings. For example, this filing uses <b> to denote bold, whereas some say something like font-weight = bold.

What is the best way to deal with this? Is there an R package that will parse the HTML and either tell me that it is bold and italic, or return a list of tags which are specifically formatting tags (not span, p etc).

Alternatively, how can i check each row against a manually compiled list of indicators of bold and italic ("bold", <b>, strong) and have it return any elements of the list which are matched for each row?

At the end, I plan to tabulate values to determine heading levels. E.g. if I count 100 elements with neither bold nor italic, 20 elements with just <b>, and 10 elements containing <b> and "Italic", I can deduce that bold and italic represents headings for this particular filing, and bold alone denotes subheadings.

1

There are 1 answers

4
r2evans On

I think all you're looking for is if a particular string contains html markup that indicates something in that string should be bold and/or italics.

S <- '<p style="margin-top:18px;margin-bottom:0px"><font style="font-family:ARIAL" size="2"><b><i>Our quarterly operating results have fluctuated in the past and might continue to fluctuate, causing the value of our common stock to decline substantially. </i></b></font></p>'
grepl("<b>|<font-weight\\s*=\\s*bold", S, ignore.case = TRUE)
# [1] TRUE
grepl("<i>|<font-style\\s*=\\s*italic", S, ignore.case = TRUE)
# [1] TRUE