Excel worksheet to Numpy array

11.7k views Asked by At

I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic: say my worksheet was loaded as "ws", the code:

A = np.zeros((37,3))
for i in range(2,39):
   for j in range(1,4):
      A[i-2,j-1]= ws.cell(row = i, column = j).value

loads the contents of "ws" into array A.

There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?

Thank you in advance for your answers.

PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.

2

There are 2 answers

4
paddyg On

You could do

A = np.array([[i.value for i in j] for j in ws['C1':'E38']])

EDIT - further explanation. (firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)

  1. the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
  2. to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
  3. each row is a tuple (even if only one column wide) of
  4. Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
  5. there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.

The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns

A = np.array([[i.value for i in j[2:5]] for j in ws.rows])

if you don't know how many columns then you will have to loop and check values more like your original idea

0
Padix Key On

If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.

import numpy as np
from tableconvert.converter import convert

array = convert("""
123    456    3.14159
SOMETEXT    2,71828    0
""")

print(type(array))
print(array)

Output:

<class 'numpy.ndarray'>
[[ 123.       456.         3.14159]
 [       nan    2.71828    0.     ]]