EDIT: Reproducible example (I hope I'm doing it right):
I went ahead and used as.character(docs[1])
to create a string, for reproducibility:
"list(list(content = c(\"Name: Birthdate (MM/DD): Print Date: Student ID: Institution ID: Page:\", \"\", \"MyName MyBirthday 06/16/2015 N1111111 002785 1 of 1\", \"\", \"A string I don't want\", \"\", \"More stuff I don't want\", \"Don't want\", \"\", \"Names of Classes\", \n\"\", \"Class numbers and sections that I don't want\", \"\", \"Current Cumulative\", \"\", \"AHRS (don't want)\", \"12.0 12.0 (no)\", \"\", \"EHRS (WANT THIS)\", \"12.0 12.0\", \"\", \"QHRS (no)\", \"12.0 12.0\", \"\", \"QPTS (no) \", \" (no) 45.900 45.900\", \"\", \"GPA\", \"3.825 3.825\", \"\", \"Spring 2015\", "etc", \"\", \"End of Graduate Record\", \"\", \"\\f\"), meta = list(author = NULL, datetimestamp = NULL, description = NULL, heading = NULL, id = \"Unofficial June 2015 copy 2.pdf\", language = \"en\", origin = NULL)))"
All I want out of this mess is the ID Number (which is N1111111 in this example), the semester (Fall 2014 and Spring 2015), the numbers following EHRS (12.0 12.0, each in its own column), and the numbers following GPA (3.825 3.825, each in its own column).
I have text data from academic transcripts that needs to put into a dataframe for analysis. I have converted the transcript pdf into text, but now I need certain information in a dataframe. Specifically, I need data in the following columns:
Student ID, Fall 1 Current Hours, Fall 1 Cumulative Hours, Fall 1 Current GPA, Spring 1 Current Hours, Spring 1 Cumulative Hours, Spring 1 Current GPA, Spring 1 Cumulative GPA, Summer 1 Current Hours, Summer 1 Cumulative Hours, Summer 1 Current GPA, Summer 1 Cumulative GPA
etc, for every semester the student remains at the university.
The number of hours comes from EHRS, and cases where no summer courses are listed are treated as 0 current hours, 0 current gpa, and cumulative hours and gpa are the same as for the spring immediately preceding it.
So far, I've converted the pdf to text using the tm library and have the following example transcript:
docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF()))
inspect(docs[1])
Student Name MM/YY 06/16/2015 N11111111 002785 1 of 1
Name of University Beginning of Graduate Record
Fall 2014 Name of School Master of Science Major: Major
Name of Class 1 Name of Class 2 Name of Class 3 Name of Class 4
COURSE+SECTION 3.0 B+ COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A COURSE+SECTION 3.0 A
Current Cumulative
AHRS 12.0 12.0
EHRS 12.0 12.0
QHRS 12.0 12.0
QPTS 45.900 45.900
GPA 3.825 3.825
Spring 2015
Name of School Master of Science Major: Major
Name of Class 1 Name of Class 2 Name of Class 3
COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A COURSE+SECTION 2.0 A-
Name of Class 4 COURSE+SECTION 2.0 A
Name of Class 5
COURSE+SECTION 2.0 A-
Name of Class 6 COURSE+SECTION 4.0 A
Name of Class 7
COURSE+SECTION 3.0 B+
Name of Class 8
COURSE+SECTION
3.0 A
Current Cumulative
AHRS 20.0 32.0
EHRS 20.0 32.0
QHRS 20.0 32.0
QPTS 76.700 122.600
GPA 3.835 3.831
End of Graduate Record
This is a strategy I use when documents look alike. If documents are EXACTLY the same. You can skip most of the grep() and use direct references (ie, txt[1]) to the location of the information you want to parse.
Extraction Tactics:
grep
to identify target row. Using anchors^
or$
work well.strsplit
to break into elements needed. Repeat last step.txt[1]
) or regex (txt[grep("GPA",txt)]
) where possible.readLines
Cleaning
Search and Parse: ID Number
Search and Parse: GPA
... and so on.