How do I extract certain words in my document into a dataframe in R?

639 views Asked by At

EDIT: Reproducible example (I hope I'm doing it right):

I went ahead and used as.character(docs[1]) to create a string, for reproducibility:

"list(list(content = c(\"Name: Birthdate (MM/DD): Print Date: Student ID: Institution ID: Page:\", \"\", \"MyName MyBirthday 06/16/2015 N1111111 002785 1 of 1\", \"\", \"A string I don't want\", \"\", \"More stuff I don't want\", \"Don't want\", \"\", \"Names of Classes\", \n\"\", \"Class numbers and sections that I don't want\", \"\", \"Current Cumulative\", \"\", \"AHRS (don't want)\", \"12.0 12.0 (no)\", \"\", \"EHRS (WANT THIS)\", \"12.0 12.0\", \"\", \"QHRS (no)\", \"12.0 12.0\", \"\", \"QPTS (no) \", \" (no) 45.900 45.900\", \"\", \"GPA\", \"3.825 3.825\", \"\", \"Spring 2015\", "etc", \"\", \"End of Graduate Record\", \"\", \"\\f\"), meta = list(author = NULL, datetimestamp = NULL, description = NULL, heading = NULL, id = \"Unofficial June 2015 copy 2.pdf\", language = \"en\", origin = NULL)))"

All I want out of this mess is the ID Number (which is N1111111 in this example), the semester (Fall 2014 and Spring 2015), the numbers following EHRS (12.0 12.0, each in its own column), and the numbers following GPA (3.825 3.825, each in its own column).


I have text data from academic transcripts that needs to put into a dataframe for analysis. I have converted the transcript pdf into text, but now I need certain information in a dataframe. Specifically, I need data in the following columns:

Student ID, Fall 1 Current Hours, Fall 1 Cumulative Hours, Fall 1 Current GPA, Spring 1 Current Hours, Spring 1 Cumulative Hours, Spring 1 Current GPA, Spring 1 Cumulative GPA, Summer 1 Current Hours, Summer 1 Cumulative Hours, Summer 1 Current GPA, Summer 1 Cumulative GPA

etc, for every semester the student remains at the university.

The number of hours comes from EHRS, and cases where no summer courses are listed are treated as 0 current hours, 0 current gpa, and cumulative hours and gpa are the same as for the spring immediately preceding it.

So far, I've converted the pdf to text using the tm library and have the following example transcript:

docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF()))

inspect(docs[1])

Student Name MM/YY 06/16/2015 N11111111 002785 1 of 1

Name of University Beginning of Graduate Record

Fall 2014 Name of School Master of Science Major: Major

Name of Class 1 Name of Class 2 Name of Class 3 Name of Class 4

COURSE+SECTION 3.0 B+ COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A

Current Cumulative

AHRS 12.0 12.0

EHRS 12.0 12.0

QHRS 12.0 12.0

QPTS 45.900 45.900

GPA 3.825 3.825

Spring 2015

Name of School Master of Science Major: Major

Name of Class 1 Name of Class 2 Name of Class 3

COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A-

Name of Class 4 COURSE+SECTION 2.0 A

Name of Class 5

COURSE+SECTION 2.0 A-

Name of Class 6 COURSE+SECTION 4.0 A

Name of Class 7

COURSE+SECTION 3.0 B+

Name of Class 8

COURSE+SECTION

3.0 A

Current Cumulative

AHRS 20.0 32.0

EHRS 20.0 32.0

QHRS 20.0 32.0

QPTS 76.700 122.600

GPA 3.835 3.831

End of Graduate Record

1

There are 1 answers

3
Brandon Bertelsen On BEST ANSWER

This is a strategy I use when documents look alike. If documents are EXACTLY the same. You can skip most of the grep() and use direct references (ie, txt[1]) to the location of the information you want to parse.

Extraction Tactics:

  • Use grep to identify target row. Using anchors ^ or $ work well.
  • Once target row is identified, use strsplit to break into elements needed. Repeat last step.
  • Use direct references (txt[1]) or regex (txt[grep("GPA",txt)]) where possible.
  • Parse and reformat in whichever manner you please.

readLines

txt <- readLines(con=textConnection(
'Student Name MM/YY 06/16/2015 N11111111 002785 1 of 1

Name of University Beginning of Graduate Record

Fall 2014 Name of School Master of Science Major: Major

Name of Class 1 Name of Class 2 Name of Class 3 Name of Class 4

COURSE+SECTION 3.0 B+ COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A

Current Cumulative

AHRS 12.0 12.0

EHRS 12.0 12.0

QHRS 12.0 12.0

QPTS 45.900 45.900

GPA 3.825 3.825

Spring 2015

Name of School Master of Science Major: Major

Name of Class 1 Name of Class 2 Name of Class 3

COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A-

Name of Class 4 COURSE+SECTION 2.0 A

Name of Class 5

COURSE+SECTION 2.0 A-

Name of Class 6 COURSE+SECTION 4.0 A

Name of Class 7

COURSE+SECTION 3.0 B+

Name of Class 8

COURSE+SECTION

3.0 A

Current Cumulative

AHRS 20.0 32.0

EHRS 20.0 32.0

QHRS 20.0 32.0

QPTS 76.700 122.600

GPA 3.835 3.831

End of Graduate Record'))

Cleaning

# trim of http://stackoverflow.com/questions/2261079/whitespace-in-r
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
txt <- trim(txt)
# Drop empties
txt <- txt[txt != ""]

Search and Parse: ID Number

id <- strsplit(txt[1], " ")
id <- id[grep("^[N][0-9]",id)] # Starts with N followed by 0-9

Search and Parse: GPA

gpa <- txt[grep("GPA",txt)]
gpa <- strsplit(gpa, " ")
gpa <- matrix(
  as.numeric(
    t(
      as.data.frame(gpa)
      )[1:2, 2:3]
    ),ncol = 2)

... and so on.