I have data in mongodb and it contains many fields, one of them is the content of the tweet that I scraped, all I want is to get all hashtags from the content then group them.
my data looks like that:
{
"_id" : NumberLong(1564531556487659520),
"content" : "Wie hat die #Corona-Pandemie den Arbeitsmarkt in Deutschland verändert? – @JansenAnika und Paula Risius vom @iw_koeln geben auf unserem Blog einen",
"likes" : NumberInt(0),
"replies" : NumberInt(0),
"retweet" : NumberInt(0)
},
{
"_id" : NumberLong(1564531463999168512),
"content" : "Start-ups noch pessimistischer als im #Corona-#Krisenjahr 2020",
"likes" : NumberInt(0),
"replies" : NumberInt(0),
"retweet" : NumberInt(0)
},
{
"_id" : NumberLong(1564531140802789381),
"content" : "Gesundheitsminister @klausholetschek fürchtet das Sinken der Hemmschwelle bei der #Legalisierung von #Cannabis. Ab Mitte September erleben wir in #München wieder das Absinken ganz anderer Hemmschwellen, #Corona-Hotspot inklusive.",
"likes" : NumberInt(1),
"replies" : NumberInt(1),
"retweet" : NumberInt(0)
}
After I write the below Code:
data = db.tweets.aggregate([{
"$project":{
"content":{
"$regexFindAll":{
"input":"$content",
"regex":r'[#]\w+'
}
}
}
},
{
"$group":{
"_id":"$content.match",
"count":{
"$sum":1
}
}
}
])
my result was different than what I want, it give me a dictionaries and each dictionary contain the "_id" which contain a list of the hashtags that I collect
my results:
{'_id': ['#Gersemann', '#Corona'], 'count': 1},
{'_id': ['#MAH', '#CORONA', '#CASES'], 'count': 3},
{'_id': ['#corona', '#coronalanding', '#coronasymptoms', '#coronawordpresstheme', '#coronavirus', '#coronavirusprevention', '#covid', '#covid19', '#covid19theme', '#covid19',
'#healthbeauty', '#healthcare', '#imithemes', '#medical'], 'count': 1},
{'_id': ['#China', '#Covid', '#Corona', '#SarsCoV2'], 'count': 1},
{'_id': ['#Gehorsam', '#Staat', '#Unterdr', '#Corona', '#Covid', '#Masken', '#Manie', '#Deutschen', '#Coronauten'], 'count': 1},
{'_id': ['#Maskenregeln', '#Corona', '#COVID19', '#Maske'], 'count': 1},
{'_id': ['#Pandemie', '#GBD', '#Medienversagen', '#Corona'], 'count': 1},
{'_id': ['#Herbst', '#Covid', '#Gesundheit', '#Corona', '#Maskenpflicht', '#Bundesregierung', '#Krankheit', '#Pandemie', '#Wochenblatt', '#WochenblattMedia', '#WochenblattNews'], 'count': 1}, {'_id': ['#COVID19', '#SARSCoV2', '#CORONA'], 'count': 1}]
but what I want is to count each hashtag alone grouped.
You can use
$unwindto split your listcontenttry it here