this code works on dummy mbox, but not on gmail takeout mbox

277 views Asked by At

I have this code, that will translate mbox to JSON. The goals is to transfer the produces JSON to Mongodb database. However, the code was tested on a dummy mbox "example.mbox" and it worked fine. Nevertheless, when the time to test it on the actual mbox, I had an unintended output that did produce the JSON file but, "Skipping MIME content in JSONification (multipart)".. I do not want to skip anything!

import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse

MBOX = 'antonita.mbox'
OUT_FILE = MBOX + '.json'

def cleanContent(msg):

    # Decode message from "quoted printable" format, but first
    # re-encode, since decodestring will try to do a decode of its own
    msg = quopri.decodestring(msg.encode('utf-8'))

    # Strip out HTML tags, if any are present.
    # Bail on unknown encodings if errors happen in BeautifulSoup.
    try:
        soup = BeautifulSoup(msg)
    except:
        return ''
    return ''.join(soup.findAll(text=True))

# There's a lot of data to process, and the Pythonic way to do it is with a 
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object 
# serialization.

class Encoder(json.JSONEncoder):
    def default(self, o): return  list(o)

# The generator itself...
def gen_json_msgs(mb):
    while 1:
        msg = mb.next()
        if msg is None:
            break

        yield jsonifyMessage(msg)

def jsonifyMessage(msg):
    json_msg = {'parts': []}
    for (k, v) in msg.items():
        json_msg[k] = v.decode('utf-8', 'ignore')

    # The To, Cc, and Bcc fields, if present, could have multiple items.
    # Note that not all of these fields are necessarily defined.

    for k in ['To', 'Cc', 'Bcc']:
        if not json_msg.get(k):
            continue
        json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
                                 .replace(' ', '').decode('utf-8', 'ignore').split(',')

    for part in msg.walk():
        json_part = {}

        if part.get_content_maintype() != 'text':
            print >> sys.stderr, "Skipping MIME content in JSONification ({0})".format(part.get_content_maintype())
            continue

        json_part['contentType'] = part.get_content_type()
        content = part.get_payload(decode=False).decode('utf-8', 'ignore')
        json_part['content'] = cleanContent(content)
        json_msg['parts'].append(json_part)

    # Finally, convert date from asctime to milliseconds since epoch using the
    # $date descriptor so it imports "natively" as an ISODate object in MongoDB
    then = parse(json_msg['Date'])
    millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
    json_msg['Date'] = {'$date' : millis}

    return json_msg

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)

# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport

f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
    if msg != None:
        f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()

print "All done"

OUT results:

Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
All done

NOTE: As some has pointed out, the phrase " I do not want to skip anything" refer to the fact that I can jSONify most of the mbox but not multipart nor images. Hence, the part in the code {for part in msg.walk(): ...} was tagged skipping to demonstrate that this code did indeed skipped multipart and images, as without it, i was getting JSON file without binaries for images etc.. It will not be present in the final code though, when i figure how to get images and multipart into JSON.

1

There are 1 answers

2
Prakhar Sharma On

Third Edition of Mining Social Web

I tried making a workable script that not just converts MBOX to JSON, but even extracts the Attachments to usable formats. Link to the repo - https://github.com/PS1607/mbox-to-json

Read the README file for usage instructions.

If you want to convert it into CSV instead, change line 55 in src/main.py from df.to_json to df.to_csv