python - how to parse semi structured text (cran.all.1400)

603 views Asked by At

I need to work with the cran.all.1400 text file.

It's a collection of abstracts from articles with some aditional data about each article. Its in the form:

.I 1
.T
experimental investigation of the aerodynamics of a wing in a slipstream .
.A
brenckman,m.
.B
j. ae. scs. 25, 1958, 324.
.W
//a lot of text
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small viscosity .
.A
ting-yili
.B
department of aeronautical engineering, rensselaer polytechnic institute troy, n.y.
.W
//lots of text


and so on.

What I need is the data organized like so:

article 1: .T="whatever the title of article 1 is", .A="w/e the author is", .B="w/e", .T="all the text"
article 2: .T="whatever the title is", .A="w/e the author is", .B="w/e", .T="all the text"

How would I go about doing this in Python? Thank you for your time.

1

There are 1 answers

5
John Coleman On BEST ANSWER

Your idea from the comments of splitting on .I seems like a good start.

The following seems to work:

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(i, article):
    article = article.replace('\n.T\n','.T=')
    article = '.T=' + article.split('.T=')[1] #strips off the article number, restored below
    article = article.replace('\n.A\n',',.A=')
    article = article.replace('\n.B\n',',.B=')
    article = article.replace('\n.W\n',',.W=')
    return 'article ' + str(i) + ':' + article

data = [process(i+1, article) for i,article in enumerate(articles)]

I created a test file consisting of only the first 10 articles (discarding a small header and all of the file starting with .I 11). When I run the above code I get a list of length 10. It is important that the very first line begins .I (with no prior newline) since I make no effort to test if the first entry of the split is empty. The first entry in the list is a string that begins:

article 1:.T=experimental investigation of the aerodynamics of a\nwing in a slipstream .,.A=brenckman,m.,.B=j. ae. scs. 25, 1958, 324.,.W=experimental investigation of the aerodynamics of a\nwing in a slipstream

On Edit Here is a dictionary version which uses partition to successively pull of the relevant chunks. It returns a dictionary of dictionaries rather than a list of strings:

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(article):
    article = article.split('\n.T\n')[1]
    T, _, article = article.partition('\n.A\n')
    A, _, article = article.partition('\n.B\n')
    B, _, W = article.partition('\n.W\n')
    return {'T':T, 'A':A, 'B':B, 'W':W}

data = {(i+1):process(article) for i,article in enumerate(articles)}

For example:

>>> data[1]
{'A': 'brenckman,m.', 'T': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .', 'B': 'j. ae. scs. 25, 1958, 324.', 'W': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .\n  an experimental study of a wing in a propeller slipstream was\nmade in order to determine the spanwise distribution of the lift\nincrease due to slipstream at different angles of attack of the wing\nand at different free stream to slipstream velocity ratios .  the\nresults were intended in part as an evaluation basis for different\ntheoretical treatments of this problem .\n  the comparative span loading curves, together with\nsupporting evidence, showed that a substantial part of the lift increment\nproduced by the slipstream was due to a /destalling/ or\nboundary-layer-control effect .  the integrated remaining lift\nincrement, after subtracting this destalling lift, was found to agree\nwell with a potential flow theory .\n  an empirical evaluation of the destalling effects was made for\nthe specific configuration of the experiment .'}

s.partition() returns a triple consisting of the part of the string s before the first occurrence of the delimiter, the delimiter itself, and the part of the string after that occurrence of the delimiter. The underscore (_) in the code is a Python idiom which emphasizes that the intend is to discard that part of the return value.