How to install packages which are not available when a Spark script is written and run using on-demand HDinsight cluster via Azure Data Factory ADF ?
There is an old question here but it wasn't answered. Custom script action in Azure Data Factory HDInsight Cluster
How to do a pip install inside my pyspark script ? or any other way?
My pyspark script is running on the on-demand hdinsight cluster via ADF loading data from csv blob to Azure MySQL [for a proof of concept scenario, so have to stick with hdinsight only for now, no databricks]
You can include the
pipinstall command in your PySpark script:You can also use the Custom Script Action feature in HDInsight to run a script that installs required Python packages before executing your Spark script.
As per the MS documentation Customize Azure HDInsight clusters by using script actions.
Learn more about Example script action scripts, Permissions, Access control, and Script action during cluster creation.
Learn more about Python packages for the cluster.
When you are creating the HD INSIGHT Cluster you can add script actions this will allow the invoke custom scripts to customize the cluster. These scripts are used to install additional components and change configuration settings
Know more about the Script action during cluster creation