Synapse Pipeline azure - convert xlsx to csv

76 views Asked by At

I'm really new to Azure, I started to set up the pipeline but I'm still stuck

I have xlsx format files stored in an Azure storage account that contain one or more sheets.

I want to set up a pipeline that will go through all the folders and subfolders of the specified path to convert the xlsx files into csv files.

Two figs are possible if the 'AAA' file contains three tabs 'Sheet1', 'Sheet2' and 'Sheet3', the pipeline must generate 3 csv files

  • 'AAA_sheet1'
  • 'AAA_sheet2'
  • 'AAA_sheet3' which will be stored in the specific subfolder

the second case the 'AAA' file contains only one tab 'Sheet1' so a single csv file will be generated: 'AAA_Sheet1'

Thanks for your help

I would like to be helped to solve this problem

1

There are 1 answers

0
Bhavani On

Directly, you cannot get the sheet names in ADF. You can run the Python code in a Synapse notebook mentioned below to get sheet names from Excel files:

from azure.storage.blob import BlobServiceClient
from openpyxl import load_workbook

account_name = '<storageAccountName>'
account_key = '<accesskey>'
container_name = '<containerName>'
blob_name = '<directory>/AAA.xlsx'
blob_service_client = BlobServiceClient(account_url=f"https://{account_name}.blob.core.windows.net", credential=account_key)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
local_temp_file = 'temp_excel_file.xlsx'
with open(local_temp_file, 'wb') as file:
    data = blob_client.download_blob()
    data.readinto(file)
workbook = load_workbook(filename=local_temp_file, read_only=True)
sheet_names = workbook.sheetnames
workbook.close()
mssparkutils.notebook.exit(sheet_names)

You will get the sheet names as shown below:

Run the notebook using notebook activity. Add a foreach activity to the notebook activity and use the below expression for the item:

enter image description here @json(activity('Notebook1').output.status.Output.result.exitValue)

Add a copy activity inside foreach, add Excel as source, and delimited as sink datasets with the parameter sheetName and fileName with @item() expression. Debug the pipeline; it will run successfully as shown below:

enter image description here

And the sheets copy successfully in .CSV format as shown below:

enter image description here

Here is the pipeline JSON for your reference:

{
    "name": "Pipeline 1",
    "properties": {
        "activities": [
            {
                "name": "Notebook1",
                "type": "SynapseNotebook",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "notebook": {
                        "referenceName": "Notebook 2",
                        "type": "NotebookReference"
                    },
                    "snapshot": true,
                    "sparkPool": {
                        "referenceName": "spark",
                        "type": "BigDataPoolReference"
                    },
                    "executorSize": "Small",
                    "driverSize": "Small"
                }
            },
            {
                "name": "ForEach1",
                "type": "ForEach",
                "dependsOn": [
                    {
                        "activity": "Notebook1",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "items": {
                        "value": "@json(activity('Notebook1').output.status.Output.result.exitValue\n)",
                        "type": "Expression"
                    },
                    "isSequential": true,
                    "activities": [
                        {
                            "name": "Copy data1",
                            "type": "Copy",
                            "dependsOn": [],
                            "policy": {
                                "timeout": "0.12:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "source": {
                                    "type": "ExcelSource",
                                    "storeSettings": {
                                        "type": "AzureBlobFSReadSettings",
                                        "recursive": true,
                                        "enablePartitionDiscovery": false
                                    }
                                },
                                "sink": {
                                    "type": "DelimitedTextSink",
                                    "storeSettings": {
                                        "type": "AzureBlobFSWriteSettings"
                                    },
                                    "formatSettings": {
                                        "type": "DelimitedTextWriteSettings",
                                        "quoteAllText": true,
                                        "fileExtension": ".txt"
                                    }
                                },
                                "enableStaging": false,
                                "translator": {
                                    "type": "TabularTranslator",
                                    "typeConversion": true,
                                    "typeConversionSettings": {
                                        "allowDataTruncation": true,
                                        "treatBooleanAsNumber": false
                                    }
                                }
                            },
                            "inputs": [
                                {
                                    "referenceName": "Excel1",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "sheetName": {
                                            "value": "@item()",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "DelimitedText1",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "fileName": {
                                            "value": "@item()",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ],
        "annotations": [],
        "lastPublishTime": "2024-01-12T07:14:07Z"
    },
    "type": "Microsoft.Synapse/workspaces/pipelines"
}