I have a dask dataframe that looks as follows:
digitizer_get_current_savefile motor_get_position motor_goto_strip nevts biasvoltage z-height x-dist
0 /home/data/tct-waveforms/waveform... [-76399, -20270, -1283] 1.0 15 -200.0 -1283 -76399
1 /home/data/tct-waveforms/waveform... [-76404, -20270, -1283] 2.0 15 -200.0 -1283 -76404
2 /home/data/tct-waveforms/waveform... [-76409, -20270, -1283] 3.0 15 -200.0 -1283 -76409
3 /home/data/tct-waveforms/waveform... [-76414, -20270, -1283] 4.0 15 -200.0 -1283 -76414
4 /home/data/tct-waveforms/waveform... [-76419, -20270, -1283] 5.0 15 -200.0 -1283 -76419
I want to leverage dask's single machine parallelization and, in the next step, load hdf5 data files that are located at the paths in the digitizer_get_current_savefile column in parallel.
For that, I have written this code:
import dask.dataframe as dd
channel = "CH0"
def extract_signal(row):
# Read the hdf5 data file
df_data = dd.read_hdf(row["digitizer_get_current_savefile"], key=channel)
# Drop all columns in the datafile that begin with "Time" (only keeping the amplitudes)
df_data = df_data.loc[:, ~df_data.columns.str.startswith("Time")]
return df_data.mean().max()
df['signal'] = df.apply(extract_signal, axis=1)
This does not work. Error:
OSError: File(s) not found: a
Somehow, the file paths are not recognized... I'm a beginner with dask, please excuse if I made a stupid mistake.
A pandas-based version of this code works fine.