Reading PDF portfolio in R

185 views Asked by At

Is it possible to read/convert PDF portfolios in R?

I usually use pdftools, however, I get an error:

library(pdftools)
#> Using poppler version 0.73.0

link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")

pdftools::pdf_convert(link, dpi = 600)
#> Converting page 1 to K190072_1.png...
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#>  done!
#> [1] "K190072_1.png"

Created on 2021-05-06 by the reprex package (v1.0.0)

The K190072_1.png I finally get is only the image of the portfolio front page.

I am interessted in the document K190072.510kSummary.Final_Sent001.pdf of this PDF portfolio

I found a way for Python (Reading a PDF Portfolio in Python?) but I would really like to do that in R.

Thank you for your help.

1

There are 1 answers

0
user12728748 On BEST ANSWER

There seems to be an issue with pdf_convert handling one-page raw pdf data (it wants to use basename(pdf) under these conditions), so I have edited that function so that it also works with the second attached pdf file.

If you only need the first file then you could run this with the original pdf_convert function, but it will give an error with the second file.

If you are interested in rendering raster graphics from the attached files this worked for me:

library(pdftools)
#> Using poppler version 21.02.0
link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")

pdf_convert <- function (pdf, format = "png", pages = NULL, filenames = NULL, 
          dpi = 72, antialias = TRUE, opw = "", upw = "", verbose = TRUE) {
    config <- poppler_config()
    if (!config$can_render || !length(config$supported_image_formats)) 
        stop("You version of libppoppler does not support rendering")
    format <- match.arg(format, poppler_config()$supported_image_formats)
    if (is.null(pages)) 
        pages <- seq_len(pdf_info(pdf, opw = opw, upw = upw)$pages)
    if (!is.numeric(pages) || !length(pages)) 
        stop("Argument 'pages' must be a one-indexed vector of page numbers")
    if (length(filenames) < 2 & !is.raw(pdf)) {   # added !is.raw(pdf)
        input <- sub(".pdf", "", basename(pdf), fixed = TRUE)
        filenames <- if (length(filenames)) {
            sprintf(filenames, pages, format)
        }
        else {
            sprintf("%s_%d.%s", input, pages, format)
        }
    }
    if (length(filenames) != length(pages)) 
        stop("Length of 'filenames' must be one or equal to 'pages'")
    antialiasing <- isTRUE(antialias) || isTRUE(antialias == 
                                                    "draw")
    text_antialiasing <- isTRUE(antialias) || isTRUE(antialias == 
                                                         "text")
    pdftools:::poppler_convert(pdftools:::loadfile(pdf), format, pages, filenames, 
                    dpi, opw, upw, antialiasing, text_antialiasing, verbose)
}

lapply(pdf_attachments(link), function(x) pdf_convert(x$data, 
    filenames=paste0(tools::file_path_sans_ext(x$name), "-", 
                     seq_along(pdf_data(x$data)), ".png")))
#> Converting page 1 to K190072.510kSummary.Final_Sent001-1.png... done!
#> Converting page 2 to K190072.510kSummary.Final_Sent001-2.png... done!
#> Converting page 3 to K190072.510kSummary.Final_Sent001-3.png... done!
#> Converting page 4 to K190072.510kSummary.Final_Sent001-4.png... done!
#> Converting page 5 to K190072.510kSummary.Final_Sent001-5.png... done!
#> Converting page 1 to K190072.IFU.FINAL_Sent001-1.png... done!
#> Converting page 1 to K190072.Letter.SE.FINAL_Sent001-1.png... done!
#> Converting page 2 to K190072.Letter.SE.FINAL_Sent001-2.png... done!
#> [[1]]
#> [1] "K190072.510kSummary.Final_Sent001-1.png"
#> [2] "K190072.510kSummary.Final_Sent001-2.png"
#> [3] "K190072.510kSummary.Final_Sent001-3.png"
#> [4] "K190072.510kSummary.Final_Sent001-4.png"
#> [5] "K190072.510kSummary.Final_Sent001-5.png"
#> 
#> [[2]]
#> [1] "K190072.IFU.FINAL_Sent001-1.png"
#> 
#> [[3]]
#> [1] "K190072.Letter.SE.FINAL_Sent001-1.png"
#> [2] "K190072.Letter.SE.FINAL_Sent001-2.png"

Created on 2021-05-05 by the reprex package (v2.0.0)