I have a complete penn treebank dataset and I want to read it using ptb
from ntlk.corpus
. But in here it is said that:
If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:
But i want to keep the dataset in a local directory and then load it from there instead of from nltk_data/corpora/ptb
. ptb
always searches in that directory but how can i give a path to ptb
so that it searches in the given directory? Is there any way i can do that? I have thoroughly searched in web and tried few ways but no way worked for me!
You can keep your corpus files on your local directory and just add symlinks from an
nltk_data/corpora
folder to the location of your corpus, as the paragraph you quoted suggests. But if you can't modifynltk_data
or just don't like the idea of a needless round trip through thenltk_data
directory, read on.The object
ptb
is just a shortcut to a corpus reader object initialized with the appropriate settings for the Penn Treebank corpus. It is defined (innltk/corpus/__init__.py
) like this:You can ignore the
LazyCorpusLoader
part; it's used because the nltk defines a lot of corpus endpoints, most of which are never loaded in any one python program. Instead, create a corpus reader by instantiatingCategorizedBracketParseCorpusReader
directly. If your corpus looks exactly like theptb
corpus, you'd call it like this:As you can see, you supply the path to the real location of your files and leave the remaining arguments the same: They are a regexp of file names to include in the corpus, a file mapping corpus files to categories, and the tagset to use. The object you create will be exactly the same corpus reader as
ptb
ortreebank
(except that it is not lazily created).