how could I use complete penn treebank dataset inside python/nltk

4.4k views Asked by At

I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call nltk.download('treebank') I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file and I want to use it. In here it is said that:

If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:

So, I opened the python from terminal, imported nltk and typed nltk.download('ptb') . With this command, "ptb" directory has been created under my ~/nltk_data directory. At the end, now I have ~/nltk_data/ptb directory. Inside there, as suggested in the link I gave above, I've put my dataset folder. So this is my final directory hierarchy.

    $: pwd
    $: ~/nltk_data/corpora/ptb/WSJ
    $: ls
    $:00  02  04  06  08  10  12  14  16  18  20  22  24
      01  03  05  07  09  11  13  15  17  19  21  23  merge.log

Inside all of the folders from 00 to 24, there are many .mrg files such as wsj_0001.mrg , wsj_0002.mrg and so on.

Now, Lets return my question. Again, according to here :

I should be able to obtain the file ids if I write the followings:

>>> from nltk.corpus import ptb
>>> print(ptb.fileids()) # doctest: +SKIP
['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...]

Unfortunately, when I type print(ptb.fileids()) I got empty array.

>>> print(ptb.fileids())
[]

Is there anyone who could help me ?

EDIT here is the content of my ptb directory and some of allcats.txt file :

   $: pwd
    $: ~/nltk_data/corpora/ptb
    $: ls
    $: allcats.txt  WSJ
    $: cat allcats.txt
    $: WSJ/00/WSJ_0001.MRG news
    WSJ/00/WSJ_0002.MRG news
    WSJ/00/WSJ_0003.MRG news
    WSJ/00/WSJ_0004.MRG news
    WSJ/00/WSJ_0005.MRG news

    and so on ..
1

There are 1 answers

0
freieschaf On BEST ANSWER

The PTB corpus reader needs uppercase directory and file names (as hinted by the contents of allcats.txt that you included in your question). This clashes with many distributions of Penn Treebank out there, which use lowercase.

A quick fix for this would be renaming the folders wsj and brown and their contents to uppercase. A UNIX command you can use for this is:

find . -depth | \
    while read LONG 
    do 
        SHORT=$( basename "$LONG" | tr '[:lower:]' '[:upper:]' )
        DIR=$( dirname "$LONG" ) 
        if [ "${LONG}" != "${DIR}/${SHORT}"  ] 
        then 
            mv "${LONG}" "${DIR}/${SHORT}" 
        fi 
    done

(Obtained from this question). It will change directory and file names to uppercase recursively.