I try to learn this lesson https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

  1. Method 1: from anaconda - window

by download the JP notebook to my Downloads folder, then start the jupyter notebook via anaconda

when I run the line

!$HOME/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION

but it raise error "'$HOME' is not recognized as an internal or external command, operable program or batch file."

after follow all the step in https://stackoverflow.com/a/40514875/12544460 but it still does not work.

  • I start the jupyter note book in anaconda prompt from C:\Users\name\

  • my downloaded notebook in C:\Users\name\Downloads

  • my location to run Spark connect is:

C:\Users\name\anaconda3\envs\pyspark_env\Lib\site-packages\pyspark (already set HOME="" in anaconda cmd)

for method 1: how to fix the home location?

  1. Method 2: run pyspark in spark file in ubuntu

following this lesson: https://spark.apache.org/docs/latest/spark-connect-overview.html

the very first lines work fine until I run to the step "spark = SparkSession.builder.getOrCreate()" it always raise error like "ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.". Mention that this is a new extracted Spark folder. And then I try to install pandas via: "pip install pandas".... successful but it still raise the error above. I tried multiple times to find where to put the pandas zip or extracted in the spark folder, but it still did not work.

For the method 2, what is the proper way to fix this problem?

0

There are 0 answers