Read complete penn treebank dataset from local directory

1.3k views Asked by At

I have a complete penn treebank dataset and I want to read it using ptb from ntlk.corpus. But in here it is said that:

If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:

But i want to keep the dataset in a local directory and then load it from there instead of from nltk_data/corpora/ptb. ptb always searches in that directory but how can i give a path to ptb so that it searches in the given directory? Is there any way i can do that? I have thoroughly searched in web and tried few ways but no way worked for me!

1

There are 1 answers

0
alexis On BEST ANSWER

You can keep your corpus files on your local directory and just add symlinks from an nltk_data/corpora folder to the location of your corpus, as the paragraph you quoted suggests. But if you can't modify nltk_data or just don't like the idea of a needless round trip through the nltk_data directory, read on.

The object ptb is just a shortcut to a corpus reader object initialized with the appropriate settings for the Penn Treebank corpus. It is defined (in nltk/corpus/__init__.py) like this:

ptb = LazyCorpusLoader( # Penn Treebank v3: WSJ and Brown portions
    'ptb', CategorizedBracketParseCorpusReader, r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG',
    cat_file='allcats.txt', tagset='wsj')

You can ignore the LazyCorpusLoader part; it's used because the nltk defines a lot of corpus endpoints, most of which are never loaded in any one python program. Instead, create a corpus reader by instantiating CategorizedBracketParseCorpusReader directly. If your corpus looks exactly like the ptb corpus, you'd call it like this:

from nltk.corpus.reader import CategorizedBracketParseCorpusReader
myreader = CategorizedBracketParseCorpusReader(r"<path to your corpus>", 
    r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG', 
    cat_file='allcats.txt', tagset='wsj')

As you can see, you supply the path to the real location of your files and leave the remaining arguments the same: They are a regexp of file names to include in the corpus, a file mapping corpus files to categories, and the tagset to use. The object you create will be exactly the same corpus reader as ptb or treebank (except that it is not lazily created).