My OS is Windows 10 64 bit and I use Anaconda 3.8 64 bit. I try to develop Hadoop File System 3.3 client with PyArrow module. Installation of PyArrow with conda on windows 10 is successful.
> conda install -c conda-forge pyarrow
But connection of hdfs 3.3 with pyarrow throws errors like below,
import pyarrow as pa
fs = pa.hdfs.connect(host='localhost', port=9000)
The errors are
Traceback (most recent call last):
File "C:\eclipse-workspace\PythonFredProj\com\aaa\fred\hdfs3-test.py", line 14, in <module>
fs = pa.hdfs.connect(host='localhost', port=9000)
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 208, in connect
fs = HadoopFileSystem(host=host, port=port, user=user,
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 38, in __init__
_maybe_set_hadoop_classpath()
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 136, in _maybe_set_hadoop_classpath
classpath = _hadoop_classpath_glob(hadoop_bin)
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 163, in _hadoop_classpath_glob
return subprocess.check_output(hadoop_classpath_args)
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 489, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid win32 application
I install the Visual C++ 2015 on Windows 10. But the same errors are still shown.
This is my solution.
Before starting the pyarrow, Hadoop 3 has to be installed on your windows 10 64 bit. and the installation path has to be set on Path
install pyarrow 3.0 (version is important. have to be 3.0)
pip install pyarrow==3.0
create PyDev module on eclipse PyDev perspective. The sample codes are like below
from pyarrow import fs
hadoop = fs.HadoopFileSystem("localhost", port=9000) print(hadoop.get_file_info('/'))
choose your created pydev module and click the
[Properties (Alt + Enter)]
Click the [Run/Debug Settings]. Choose the the pydev module and [Edit] button.
In [Edit Configuration] window, select the [Environment] tab
Click [Add] button
You have to make 2 Environment Variables. "CLASSPATH" and "LD_LIBRARY_PATH"
copy the returned values and paste them into Value text field (The retured values are long string value. but copy them all)