use Python to Read/Write a large HDF5 file in remote server

196 views Asked by At

I am currently maintaining a large data file (over 10G) that is updated daily on a Linux server. The data is in HDF5 format with groups and datasets.

Now I have a local Windows client that needs to use the data for some presentations (the presentation tool is customized and can only run locally on Windows), and my colleagues also depend on this data and they will also manipulate the data. Therefore, this data file will be constantly modified and updated between different clients.

Since the file is large and I need to ensure that the data on the server (and the data I use) is always up to date (including the parts modified by my colleagues), it is not feasible to download the data to the local machine via SCP, modify it and then SCP it back to the server.

So, I considered two solutions and also, I create a test csv file under the same path at /home/shared/test.csv to check my tryings.

  1. Method1: Directly use python, get the handler of the h5 file through the paramiko library.
import pandas as pd
import paramiko

client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect('', username='', password='')  # use username and password to connect
sftp = client.open_sftp()
remote_file = sftp.open("/home/shared/test.csv")
print(pd.read_csv(remote_file)
#      Code  Value1  Value2
# 0  M000001    10.0    20.0
# 1  M000002     5.5     2.7
# 2  F000003    -5.0    47.0

It works well on CSV file. But then I try to copy the same way to manipulate HDF5 file:

In [24]: remote_file = sftp.open("/home/shared/db.h5")
In [25]: df = pd.read_hdf(remote_file, '/Data_20230710')
Traceback (most recent call last):
  File "D:\miniconda3\envs\dev\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-8d766baf522b>", line 1, in <module>
    pd.read_hdf(remote_file, '/Data_20230710')
  File "D:\miniconda3\envs\dev\lib\site-packages\pandas\io\pytables.py", line 407, in read_hdf
    raise NotImplementedError(
NotImplementedError: Support for generic buffers has not been implemented.

Then I try to use PyTables as a lower API to do the same thing:

In [26]: with tb.open_file(remote_file, 'r') as f:
             node = f.root.Data_20230710
             print(node.name)
Traceback (most recent call last):
  File "D:\miniconda3\envs\dev\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-2496f9b76c1f>", line 1, in <module>
    with tb.open_file(remote_file, 'r') as f:
  File "D:\miniconda3\envs\dev\lib\site-packages\tables\file.py", line 265, in open_file
    filename = os.fspath(filename)
TypeError: expected str, bytes or os.PathLike object, not SFTPFile

After some searching, I found an answer from the following question: Read h5 file from remote

But that's how you only get the a remote file handle (which you can stream, seek and do whatever else you would to your local file), sadly on second look - HDFStore expects a path to the file and performs all the file handling through PyTables so unless you want to hack PyTables to work with remote data (and you don't) your best bet is to install sshfs and mount your remote file system to your local one, and then let Pandas treat the remote files as local ones

So I try the second method:

  1. Method2: Mount the folder on the server to Windows and operate the data through SSH or SFTP.

I install winfsp and sshfx-win and mount the remote server folder to Z:\ successfully, so far so good. Then I use the CSV file firstly as a test:

import pandas as pd
df = pd.read_csv('Z:/shared/test.csv')

# success

However, as before, in HDF5:

df = pd.read_hdf('Z:/shared/db.h5', '/Data_20230710')

Traceback (most recent call last):
  File "D:\miniconda3\envs\dev\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-896cfc7e2420>", line 1, in <module>
    df = pd.read_hdf('Z:/shared/db.h5', '/Data_20230710')
  File "D:\miniconda3\envs\dev\lib\site-packages\pandas\io\pytables.py", line 420, in read_hdf
    store = HDFStore(path_or_buf, mode=mode, errors=errors, **kwargs)
  File "D:\miniconda3\envs\dev\lib\site-packages\pandas\io\pytables.py", line 579, in __init__
    self.open(mode=mode, **kwargs)
  File "D:\miniconda3\envs\dev\lib\site-packages\pandas\io\pytables.py", line 731, in open
    self._handle = tables.open_file(self._path, self._mode, **kwargs)
  File "D:\miniconda3\envs\dev\lib\site-packages\tables\file.py", line 300, in open_file
    return File(filename, mode, title, root_uep, filters, **kwargs)
  File "D:\miniconda3\envs\dev\lib\site-packages\tables\file.py", line 750, in __init__
    self._g_new(filename, mode, **params)
  File "tables\hdf5extension.pyx", line 366, in tables.hdf5extension.File._g_new
  File "D:\miniconda3\envs\dev\lib\site-packages\tables\utils.py", line 138, in check_file_access
    path = Path(filename).resolve()
  File "D:\miniconda3\envs\dev\lib\pathlib.py", line 1215, in resolve
    s = self._flavour.resolve(self, strict=strict)
  File "D:\miniconda3\envs\dev\lib\pathlib.py", line 215, in resolve
    s = self._ext_to_normal(_getfinalpathname(s))
OSError: [WinError 1005] The volume does not contain a recognized file system.
Please make sure that all required file system drivers are loaded and that the volume is not corrupted.:'Z:\\shared'

Both of these methods work well for CSV and other types, but they have various problems for HDF5 files.

After all these tryings, I also used:

  • SFTPDrive, but it seems it still has to download the whole file from the remote server.
  • samba service, but Windows connection cannot be established successfully.

Please give some advice about my problem and help me. Thanks a lot.

0

There are 0 answers