Read an object from Alibaba OSS and modify it using pandas python

701 views Asked by At

So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud. I am currently executing a Python script, wherein:

  1. I download the file into my local machine.
  2. Do the changes using Python script in my local machine.
  3. Store it in AWS Cloud.

I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script. The Python script will be uploaded into Task Management of Alibaba Cloud.

So the new steps will be:

  1. Read a file from the OSS bucket into Pandas.
  2. Modify it - Merging it with other data, some column changes. - Will be done in pandas.
  3. Store the modified file into AWS RDS.

I am stuck at the first step itself. Error Log:

"No module found" for OSS2 & pandas.

What is the correct way of doing it?

This is a rough draft of my script (on how was able to execute script in my local machine):

import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):

    #Authentication
    auth = oss2.Auth(access_key_id, access_key_secret)

    # Bucket name
    bucket = oss2.Bucket(auth, endpoint, bucket)

    # Download the file
    try:
        # List all objects in the fun folder and its subfolders.
        for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
            order_file = obj.key
            objectName = order_file.split('/')[1]
            df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
            # FUNCTION to modify and upload
        print("File downloaded")
    except:
        print("Pls check!!! File not read")
    return objectName
1

There are 1 answers

1
Nicolas Ang On
import os,re
import oss2 
import datetime as dt
import pandas as pd 
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice

import io ## include this new library 

dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):

    #Authentication
    auth = oss2.Auth(access_key_id, access_key_secret)

    # Bucket name
    bucket = oss2.Bucket(auth, endpoint, bucket)

    # Download the file
    try:
        # List all objects in the fun folder and its subfolders.
        for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
            order_file = obj.key
            objectName = order_file.split('/')[1]


            bucket_object = bucket.get_object(order_file).read() ## read the file from OSS 
            img_buf = io.BytesIO(bucket_object)) 

            df = pd.read_csv(img_buf) # to read into pandas
            # FUNCTION to modify and upload
        print("File downloaded")
    except:
        print("Pls check!!! File not read")
    return objectName