AWS Sagemaker Studio JupyterLab Space: Glue Pyspark and Ray Kernel Python and pip version mismatch

74 views Asked by At

It looks like there is a discrepancy between the Python version and the pip-installed packages in my Glue Pyspark and Ray kernel in my AWS Sagemaker Studio JupyterLab Space. I first noticed the issue when I was trying to import IPython, by which I received a ModuleNotFound Error, but if I did !pip list | grep ipython I get ipython 8.20.0. I also did !ipython --version which gives 8.20.0.

Upon further investigation, I did the following.

1. With the Glue Pyspark and Ray kernel

import sys
print(sys.path)

which gives

['/tmp', '/tmp/spark-7a070785-711d-4791-9ed3-631d12bc29a0/userFiles-298f6845-dc16-4a6c-95e2-0245bbc35529', '/opt/amazon/spark/python/lib/pyspark.zip', '/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip', '/opt/amazon/lib/python3.6/site-packages', '/usr/lib64/python37.zip', '/usr/lib64/python3.7', '/usr/lib64/python3.7/lib-dynload', '/home/spark/.local/lib/python3.7/site-packages', '/usr/lib64/python3.7/site-packages', '/usr/lib/python3.7/site-packages']

!python --version however gives 3.10.13

I also did pandas.__version__ which gives 1.3.2, but !pip list | grep pandas gives

pandas 2.1.4 pandas-stubs 2.1.4.231227

2. With the standard Python 3 (ipykernel)

import sys
print(sys.path)

which gives

['/home/sagemaker-user', '/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/opt/conda/lib/python3.10/site-packages']

and !python --version now gives 3.10.13, which is consistent with sys.path

I redid pandas.__version__ which now gives 2.1.4, while !pip list | grep pandas gives

pandas 2.1.4 pandas-stubs 2.1.4.231227

which is consistent.

3. Conclusion

It therefore seems I have an issue with the Glue Pyspark and Ray kernel where the Python version in the kernel is pointing to some other installation than the one recognized by pip, and therefore many of the pip-installed packages are not found.

I did find a similar question posted here Conflicting Python versions in SageMaker Studio notebook with Python 3.8 kernel but the accepted answer isn't really helping me.

Any assistance is greatly appreciated. Am I missing something simple here or has anyone else come across such an issue with the Glue Pyspark and Ray kernel?

UPDATE: I've created an issue here as well. The team is aware of it and will update once a fix has been implemented.

2

There are 2 answers

1
Tomonori Shimomura On

I was able to reproduce the same issue. I was not able to install additional packages to "Glue PySpark and Ray" environment by normal approaches.

As a workaround, I found it is still possible to install additional packages by executing Python code as below:

import pip
pip.main(['install', "ipython"])

I would like to report this issue to the team.

0
Tomonori Shimomura On

You can install additional packages to the Glue PySpark and Ray kernel by running following magic command before running other Python cells.

%additional_python_modules pandas

See also : https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html