Process every string in specific field of a Collection

55 views Asked by At

I'm working with Python 3.6, Pymongo 3.3.0 and MongoDB version 2.6.12. I'm a total beginner both with Python and MongoDB, sorry if the answer seems to obvious.

I am lacking a general concept of building some kind of data processing pipeline to transform MongoDB collections with pymongo. I have a collection with about 800000 documents which all look like this:

{'_id': ObjectId('some_id'), 
 'accession': 'an_integer',
 'cik':    'another_integer',
 'filing_date': datetime.datetime(some_date),
 'item': 'some_string'}

Now I want to build some kind of pipeline which processes only the string in the field 'item' of every document with some tools from the nltk module (deleting stopwords, stemming etc.) and writes these processed documents into a new collection. If I'm not mistaken the aggregation framework within MongoDB only supports using its predefined commands so I can't use that?

I just don't know where to start actually, so I appreciate any help. (I do know how to apply the nltk methods to a single string stored as a variable within Python, but I don't know how to apply this to a collection as a whole.) Thanks in advance.

0

There are 0 answers