I am using IBM Cloud Object Storage and want to read a pdf file from the storage and want to store its text content in form of string

460 views Asked by At

I have used ibm_boto3 as mentioned in the IBM COS documentations. I have defined the resources as following:

cos = ibm_boto3.resource("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=SERVICE_INSTANCE_ID,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT
)

Following is the code that I am using to get the content of the pdf file:

def get_item(bucket_name, item_name):
    print("Retrieving item from bucket: {0}, key: {1}".format(bucket_name, item_name))
    try:
        file = cos.Object(bucket_name, item_name).get()
        file_content = file["Body"].read() #returns data in bytes
        #print("\nFILE:-------------------------\n", file) #shows the meta data of the object
        return file_content
    except ClientError as be:
        print("CLIENT ERROR: {0}\n".format(be))
    except Exception as e:
        print("Unable to retrieve file contents: {0}\n".format(e))

The object is of ibm_botocore.response.StreamingBody object type. I am not able to convert the data obtained in bytes to string. I have tried decoding with utf-8 and base64 but doesn't work. I get the following error when I try and decode with utf-8:

Unable to retrieve file contents: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte

I am also unable to figure out what type of encoding is used by IBM COS.

Thanks in advance.

0

There are 0 answers