Unlike every other question I can find, I do not want to create a DataFrame from a homogeneous Numpy array, nor do I want to convert a structured array into a DataFrame.

What I want is to create a DataFrame from individual 1D Numpy arrays for each column. I tried the obvious DataFrame({"col": nparray, "col": nparray}), but this shows up at the top of my profile, so it must be doing something really slow.

It is my understanding that Pandas DataFrames are implemented in pure Python, where each column is backed by a Numpy array, so I would think there is an efficient way to do it.

What I'm actually trying to do is to fill a DataFrame efficiently from Cython. Cython has memoryviews that allow efficient access to Numpy arrays. So my strategy is to allocate a Numpy array, fill it with data and then put it in a DataFrame.

The opposite works quite fine, creating a memoryview from a Pandas DataFrame. So if there is a way to preallocate the entire DataFrame and then just pass the columns to Cython, this is also acceptable.

cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

A section of the profile of my code looks like this, where everything the code does is completely dwarfed by creating the DataFrame at the end.

         1100546 function calls (1086282 primitive calls) in 4.345 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.345    4.345 profile:0(<code object <module> at 0x7f4e693d1c90, file "test.py", line 1>)
    445/1    0.029    0.000    4.344    4.344 :0(exec)
        1    0.006    0.006    4.344    4.344 test.py:1(<module>)
     1000    0.029    0.000    2.678    0.003 :0(run_df)
     1001    0.017    0.000    2.551    0.003 frame.py:378(__init__)
     1001    0.018    0.000    2.522    0.003 construction.py:170(init_dict)

Corresponding code:

def run_df(self, df):
    cdef int arx_rows = len(df)
    cdef int arx_idx

    cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

    data_out_np = np.zeros(arx_rows, dtype="int32")
    cdef int32_t[:] data_out = data_out_np

    for arx_idx in range(arx_rows):
        self.cpp_sec_par.run(data_in[arx_idx],data_out[arx_idx],)

    return pd.DataFrame({
        'data_out': data_out_np,
    })

1 Answers

0
Arash On

May I suggest adding the columns one by one. It might help with efficiency. Like this for example,

import numpy as np
import pandas as pd

df = pd.DataFrame()

col1 = np.array([1, 2, 3])
col2 = np.array([4, 5, 6])

df['col1'] = col1
df['col2'] = col2
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6