Process every string in specific field of a Collection

55 views Asked by oxymoron At 14 September 2024 at 21:59

I'm working with Python 3.6, Pymongo 3.3.0 and MongoDB version 2.6.12. I'm a total beginner both with Python and MongoDB, sorry if the answer seems to obvious.

I am lacking a general concept of building some kind of data processing pipeline to transform MongoDB collections with pymongo. I have a collection with about 800000 documents which all look like this:

{'_id': ObjectId('some_id'), 
 'accession': 'an_integer',
 'cik':    'another_integer',
 'filing_date': datetime.datetime(some_date),
 'item': 'some_string'}

Now I want to build some kind of pipeline which processes only the string in the field 'item' of every document with some tools from the nltk module (deleting stopwords, stemming etc.) and writes these processed documents into a new collection. If I'm not mistaken the aggregation framework within MongoDB only supports using its predefined commands so I can't use that?

I just don't know where to start actually, so I appreciate any help. (I do know how to apply the nltk methods to a single string stored as a variable within Python, but I don't know how to apply this to a collection as a whole.) Thanks in advance.

Original Q&A

TechQA.

Process every string in specific field of a Collection

There are 0 answers

Related Questions in PYTHON

Related Questions in MONGODB

Related Questions in PYMONGO

Related Questions in PIPELINE

Related Questions in PYMONGO-3.X

Popular Questions

Popular Tags

Trending Questions