It looks like there is a discrepancy between the Python version and the pip-installed packages in my Glue Pyspark and Ray kernel in my AWS Sagemaker Studio JupyterLab Space. I first noticed the issue when I was trying to import IPython, by which I received a ModuleNotFound Error, but if I did !pip list | grep ipython I get ipython 8.20.0. I also did !ipython --version which gives 8.20.0.
Upon further investigation, I did the following.
1. With the Glue Pyspark and Ray kernel
import sys
print(sys.path)
which gives
['/tmp', '/tmp/spark-7a070785-711d-4791-9ed3-631d12bc29a0/userFiles-298f6845-dc16-4a6c-95e2-0245bbc35529', '/opt/amazon/spark/python/lib/pyspark.zip', '/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip', '/opt/amazon/lib/python3.6/site-packages', '/usr/lib64/python37.zip', '/usr/lib64/python3.7', '/usr/lib64/python3.7/lib-dynload', '/home/spark/.local/lib/python3.7/site-packages', '/usr/lib64/python3.7/site-packages', '/usr/lib/python3.7/site-packages']
!python --version however gives 3.10.13
I also did pandas.__version__ which gives 1.3.2, but !pip list | grep pandas gives
pandas 2.1.4 pandas-stubs 2.1.4.231227
2. With the standard Python 3 (ipykernel)
import sys
print(sys.path)
which gives
['/home/sagemaker-user', '/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/opt/conda/lib/python3.10/site-packages']
and !python --version now gives 3.10.13, which is consistent with sys.path
I redid pandas.__version__ which now gives 2.1.4, while !pip list | grep pandas gives
pandas 2.1.4 pandas-stubs 2.1.4.231227
which is consistent.
3. Conclusion
It therefore seems I have an issue with the Glue Pyspark and Ray kernel where the Python version in the kernel is pointing to some other installation than the one recognized by pip, and therefore many of the pip-installed packages are not found.
I did find a similar question posted here Conflicting Python versions in SageMaker Studio notebook with Python 3.8 kernel but the accepted answer isn't really helping me.
Any assistance is greatly appreciated. Am I missing something simple here or has anyone else come across such an issue with the Glue Pyspark and Ray kernel?
UPDATE: I've created an issue here as well. The team is aware of it and will update once a fix has been implemented.
I was able to reproduce the same issue. I was not able to install additional packages to "Glue PySpark and Ray" environment by normal approaches.
As a workaround, I found it is still possible to install additional packages by executing Python code as below:
I would like to report this issue to the team.