I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text
and tabulizer::extract_tables
. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested" information, as shown in the picture.
What is the best way to approach this? Splitting by white space using stringr::str_split_fixed
with n=3
gave me matrix, but it seems too difficult to create a regular expression to detect the information I want (only after the Description, and Incident Date/Time) within each column.
I think a regular expressions approach isn't that complicated: