access to files in data lake via open method (abfss)

78 views Asked by At

In the past used the mount point to read the files from the data lake using open. Now we dont want to do it anymore but use the external location path abfss

below code is not working. No such file or directory

with open('abfss://urlofcloudstorage/container/file.txt') as f:
data = f.read()

I just got aware that open method works onlu with local files and it cant read anything from abfss

what would be a solution to read the file from datalake. I have seen one option dbutils.fs.cp but I dont really want to copy the files locally. any advice?

UPDATE:I also tried dbutils.fs.cp but as Im using the shared access mode cluster, it is not supported

  def decrypt_csv_file_to_pandas(self, source_path, pgp_passphrase, csv_separator):
    """
    Decrypt a csv file directly into a pandas dataframe.
    """
    with open(source_path, 'rb') as f:
      decrypted = self.gpg.decrypt_file(
        file=f,
        passphrase= pgp_passphrase
      )
      print(decrypted.status)
      df_pd = pd.read_csv(io.StringIO(str(decrypted)) , sep=csv_separator, low_memory=False, keep_default_na=False)
      return df_pd 
1

There are 1 answers

2
JayashankarGS On

Install the Python package adlfs in the Databricks library tab or use the command below:

pip install adlfs

Then, use the following code:

from adlfs import AzureBlobFileSystem
key = "z9XY91xxxxxxxxxxxxxxxxxyyyyyyyyyyyy"
container_name = "data"
file_path = "pdf/titanic.csv"
abfs = AzureBlobFileSystem(account_name="jadls2", account_key=key)

with abfs.open(f"{container_name}/{file_path}", "r") as f:
    print(f.read())

Here, I have provided the key while configuring, but I would not recommend that. Instead, use a SAS token or service principal.

Check this for more information on arguments for different credentials.

Output:

enter image description here

I am only printing the file data. In your case, you should decrypt it and read it into a pandas DataFrame.