Tabulize function in R

325 views Asked by At

I want to extract the table of page 112 in this pdf document:

http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040

# report 2017 
url_location <-"http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040"
out <- extract_tables(url_location, pages = 112)

I have tried using these tutorials (link1,link2) about 'tabulize' package but I largely failed. There are some difficult aspects which I am not very experienced how to handle in R.

Can someone suggest something and help me with that ?

Installation

devtools::install_github("ropensci/tabulizer")

# load package
library(tabulizer)
1

There are 1 answers

4
hrbrmstr On BEST ANSWER

Java deps — while getting easier to deal with — aren't necessary when the tables are this clean. Just a bit of string wrangling will get you what you need:

library(pdftools)
library(stringi)
library(tidyverse)

# read it with pdftools
book <- pdf_text("global-wealth-databook.pdf")

# go to the page
lines <- stri_split_lines(book[[113]])[[1]] 

# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")

# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1

# find line after table
end <- last(which(lines == ""))-1

# smuch into something read.[table|csv] can read
tab <- paste0(stri_replace_all_regex(lines[start:end], "[[:space:]][[:space:]]+", "\t"), collapse="\n")

#read it
read.csv(text=tab, header=FALSE, sep="\t", stringsAsFactors = FALSE)
##                   V1      V2      V3      V4    V5   V6    V7    V8    V9  V10
## 1              Egypt  56,036   3,168     324  98.1  1.7   0.2   0.0 100.0 91.7
## 2        El Salvador   3,957  14,443   6,906  66.0 32.8   1.2   0.0 100.0 65.7
## 3  Equatorial Guinea     670   8,044   2,616  87.0 12.2   0.7   0.1 100.0 77.3
## 4            Eritrea   2,401   3,607   2,036  94.5  5.4   0.1 100.0  57.1   NA
## 5            Estonia   1,040  43,158  27,522  22.5 72.2   5.1   0.2 100.0 56.4
## 6           Ethiopia  49,168     153     103 100.0  0.0 100.0  43.4    NA   NA
## 7               Fiji     568   6,309   3,059  85.0 14.6   0.4   0.0 100.0 68.2
## 8            Finland   4,312 159,098  57,850  30.8 33.8  33.5   1.9 100.0 76.7
## 9             France  49,239 263,399 119,720  25.3 21.4  49.3   4.0 100.0 70.2
## 10             Gabon   1,098  15,168   7,367  62.0 36.5   1.5   0.0 100.0 68.4
## 11            Gambia     904     898     347  99.2  0.7   0.0 100.0  72.4   NA
## 12           Georgia   2,950  19,430   9,874  50.7 47.6   1.6   0.1 100.0 66.8
## 13           Germany  67,244 203,946  47,091  29.5 33.7  33.9   2.9 100.0 79.1
## 14             Ghana  14,574     809     411  99.5  0.5   0.0 100.0  66.1   NA
## 15            Greece   9,020 111,684  54,665  20.7 52.9  25.4   1.0 100.0 67.7
## 16           Grenada      70  17,523   4,625  74.0 24.3   1.5   0.2 100.0 81.5
## 17            Guinea   5,896     814     374  99.4  0.6   0.0 100.0  69.7   NA
## 18     Guinea-Bissau     884     477     243  99.8  0.2 100.0  65.6    NA   NA
## 19            Guyana     467   5,345   2,510  89.0 10.7   0.3   0.0 100.0 67.2
## 20             Haiti   6,172   2,879     894  96.2  3.6   0.2   0.0 100.0 76.9
## 21         Hong Kong   6,172 193,248  46,079  26.3 50.9  20.9   1.9 100.0 85.1
## 22           Hungary   7,846  39,813  30,111  11.8 83.4   4.8   0.0 100.0 45.3
## 23           Iceland     245 587,649 444,999  13.0 72.0  15.0 100.0  46.7   NA
## 24             India 834,608   5,976   1,295  92.3  7.2   0.5   0.0 100.0 83.0
## 25         Indonesia 167,559  11,001   1,914  81.9 17.0   1.1   0.1 100.0 83.7
## 26              Iran  56,306   3,831   1,856  94.1  5.7   0.2   0.0 100.0 67.3
## 27           Ireland   3,434 248,466  84,592  31.2 22.7  42.3   3.6 100.0 81.3
## 28            Israel   5,315 198,406  78,244  22.3 38.7  36.7   2.3 100.0 74.2
## 29             Italy  48,544 223,572 124,636  21.3 22.0  54.1   2.7 100.0 66.0
## 30           Jamaica   1,962   9,485   3,717  79.0 20.2   0.8   0.0 100.0 74.3
## 31             Japan 105,228 225,057 123,724   7.9 35.7  53.9   2.6 100.0 60.9
## 32            Jordan   5,212  13,099   6,014  65.7 33.1   1.2   0.0 100.0 76.1
## 33        Kazakhstan  12,011   4,441     334  97.6  2.1   0.3   0.0 100.0 92.6
## 34             Kenya  23,732   1,809     662  97.4  2.5   0.1   0.0 100.0 77.2
## 35             Korea  41,007 160,609  67,934  20.0 40.5  37.8   1.7 100.0 70.0
## 36            Kuwait   2,996  97,304  37,788  30.3 48.3  20.4   1.0 100.0 76.9
## 37        Kyrgyzstan   3,611   4,689   2,472  92.7  7.0   0.2   0.0 100.0 62.9
## 38              Laos   3,849   5,662   1,382  94.6  4.7   0.7   0.0 100.0 84.9
## 39            Latvia   1,577  27,631  17,828  29.0 68.6   2.2   0.1 100.0 53.6
## 40           Lebanon   4,085  24,161   6,452  69.0 28.5   2.3   0.2 100.0 82.0
## 41           Lesotho   1,184   3,163     945  95.9  3.8   0.3   0.0 100.0 79.8
## 42           Liberia   2,211   2,193     959  97.3  2.6   0.1   0.0 100.0 71.6
## 43             Libya   4,007  45,103  24,510  29.6 61.1   9.2   0.2 100.0 59.9
## 44         Lithuania   2,316  27,507  17,931  27.3 70.4   2.1   0.1 100.0 51.6
## 45        Luxembourg     450 313,687 167,664  17.0 20.0  58.8   4.2 100.0 68.1
## 46         Macedonia   1,607   9,044   5,698  77.0 22.5   0.5   0.0 100.0 56.4

UPDATE

This is more generic but you'll still have to do some manual cleanup. I think you would even if you used Tabula.

library(pdftools)
library(stringi)
library(tidyverse)

# read it with pdftools
book <- pdf_text("~/Downloads/global-wealth-databook.pdf")

transcribe_page <- function(book, pg) {

  # go to the page
  lines <- stri_split_lines(book[[pg]])[[1]] 

  # remove footer
  lines <- discard(lines, stri_detect_fixed, "Credit Suisse")

  # find line before start of table
  start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1

  # find line after table
  end <- last(which(lines == ""))-1

  # get the target rows
  rows <- lines[start:end]

  # map out where data values are
  stri_replace_first_regex(rows, "([[:alpha:]]) ([[:alpha:]])", "$1_$2") %>% 
    stri_replace_all_regex("[^[:blank:]]", "X") %>% 
    map(~rle(strsplit(.x, "")[[1]])) -> pos

  # compute the number of data fields
  nfields <- ceiling(max(map_int(pos, ~length(.x$lengths))) / 2)

  # do our best to get them into columns
  data_frame(rec = rows) %>% 
    separate(rec, into=sprintf("X%s", 1:nfields), sep="[[:space:]]{2,}", fill="left") %>% 
    print(n=length(rows))

}

transcribe_page(book, 112)
transcribe_page(book, 113)
transcribe_page(book, 114)
transcribe_page(book, 115)

Take a look at the outputs for ^^. They aren't in terrible shape and some of the cleanup can be programmatic.