group all elements in arrays from mongo db

46 views Asked by At

I have data in mongodb and it contains many fields, one of them is the content of the tweet that I scraped, all I want is to get all hashtags from the content then group them.

my data looks like that:

{
    "_id" : NumberLong(1564531556487659520),
    "content" : "Wie hat die #Corona-Pandemie den Arbeitsmarkt in Deutschland verändert? – @JansenAnika und Paula Risius vom @iw_koeln geben auf unserem Blog einen",
    "likes" : NumberInt(0),
    "replies" : NumberInt(0),
    "retweet" : NumberInt(0)
},
{
    "_id" : NumberLong(1564531463999168512),
    "content" : "Start-ups noch pessimistischer als im #Corona-#Krisenjahr 2020",
    "likes" : NumberInt(0),
    "replies" : NumberInt(0),
    "retweet" : NumberInt(0)
},
{
    "_id" : NumberLong(1564531140802789381),
    "content" : "Gesundheitsminister @klausholetschek fürchtet das Sinken der Hemmschwelle bei der #Legalisierung von #Cannabis. Ab Mitte September erleben wir in #München wieder das Absinken ganz anderer Hemmschwellen, #Corona-Hotspot inklusive.",
    "likes" : NumberInt(1),
    "replies" : NumberInt(1),
    "retweet" : NumberInt(0)
}

After I write the below Code:

    data = db.tweets.aggregate([{
        "$project":{
            "content":{
                "$regexFindAll":{
                    "input":"$content",
                    "regex":r'[#]\w+'
                }
            }
        }
    },
    {
        "$group":{
            "_id":"$content.match",
            "count":{
                "$sum":1
            }
        }
    }
    ])

my result was different than what I want, it give me a dictionaries and each dictionary contain the "_id" which contain a list of the hashtags that I collect

my results:

{'_id': ['#Gersemann', '#Corona'], 'count': 1}, 
{'_id': ['#MAH', '#CORONA', '#CASES'], 'count': 3}, 
{'_id': ['#corona', '#coronalanding', '#coronasymptoms', '#coronawordpresstheme', '#coronavirus', '#coronavirusprevention', '#covid', '#covid19', '#covid19theme', '#covid19', 
'#healthbeauty', '#healthcare', '#imithemes', '#medical'], 'count': 1}, 
{'_id': ['#China', '#Covid', '#Corona', '#SarsCoV2'], 'count': 1}, 
{'_id': ['#Gehorsam', '#Staat', '#Unterdr', '#Corona', '#Covid', '#Masken', '#Manie', '#Deutschen', '#Coronauten'], 'count': 1}, 
{'_id': ['#Maskenregeln', '#Corona', '#COVID19', '#Maske'], 'count': 1},
{'_id': ['#Pandemie', '#GBD', '#Medienversagen', '#Corona'], 'count': 1},
{'_id': ['#Herbst', '#Covid', '#Gesundheit', '#Corona', '#Maskenpflicht', '#Bundesregierung', '#Krankheit', '#Pandemie', '#Wochenblatt', '#WochenblattMedia', '#WochenblattNews'], 'count': 1}, {'_id': ['#COVID19', '#SARSCoV2', '#CORONA'], 'count': 1}]

but what I want is to count each hashtag alone grouped.

1

There are 1 answers

1
AlexisG On

You can use $unwind to split your list content

[
  {
    "$project": {
      "content": {
        "$regexFindAll": {
          "input": "$content",
          "regex": "[#]\\w+"
        }
      }
    }
  },
  {
    "$unwind": "$content"
  },
  {
    "$group": {
      "_id": "$content.match",
      "count": {
        "$sum": 1
      }
    }
  }
]

try it here