I need to work with the cran.all.1400 text file.
It's a collection of abstracts from articles with some aditional data about each article. Its in the form:
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream .
.A
brenckman,m.
.B
j. ae. scs. 25, 1958, 324.
.W
//a lot of text
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
department of aeronautical engineering, rensselaer polytechnic
institute
troy, n.y.
.W
//lots of text
and so on.
What I need is the data organized like so:
article 1: .T="whatever the title of article 1 is", .A="w/e the author is", .B="w/e", .T="all the text"
article 2: .T="whatever the title is", .A="w/e the author is", .B="w/e", .T="all the text"
How would I go about doing this in Python? Thank you for your time.
Your idea from the comments of splitting on
.I
seems like a good start.The following seems to work:
I created a test file consisting of only the first 10 articles (discarding a small header and all of the file starting with
.I 11
). When I run the above code I get a list of length 10. It is important that the very first line begins.I
(with no prior newline) since I make no effort to test if the first entry of the split is empty. The first entry in the list is a string that begins:On Edit Here is a dictionary version which uses
partition
to successively pull of the relevant chunks. It returns a dictionary of dictionaries rather than a list of strings:For example:
s.partition()
returns a triple consisting of the part of the strings
before the first occurrence of the delimiter, the delimiter itself, and the part of the string after that occurrence of the delimiter. The underscore (_
) in the code is a Python idiom which emphasizes that the intend is to discard that part of the return value.