Google Colab - pandas/pyplot will only accept column references not titles

312 views Asked by At

I've opened a Google Sheet in Colab using gspread

document = gc.open_by_url('https://docs.google.com/myspreadsheet')

sheet = elem.worksheet('Sheet1')

data = sheet.get_all_values()

df = pd.DataFrame(data)

The document contains element data and a print of head() looks like this:


             0          1   ...              26               27
0  AtomicNumber    Element  ...  NumberofShells  NumberofValence
1             1   Hydrogen  ...               1                1
2             2     Helium  ...               1                 
3             3    Lithium  ...               2                1
4             4  Beryllium  ...               2                2

The problem I have is that when I try to reference by title, for example:

df.plot(x = 'AtomicNumber', y= 'AtomicMass', kind = 'scatter')

I get an error. I have also tried:

df.plot(x = df.AtomicNumber, y= df.AtomicMass, kind = 'scatter')

and

df.plot(x = df['AtomicNumber'], y= df['AtomicMass'], kind = 'scatter')

but I have no joy either. Unless I am using the column references like so:

df.plot(x = 0, y= 17, kind = 'scatter')

I get nothing. It will get tiring pretty fast if I have to keep referencing the .csv file to figure out which column reference I need!!

Finally, when I print:

df.columns.values

I get:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])

I can't seem to not get this - even if I try to create a new dataframe that contains every row of df bar row index 0

I'm pretty new with this so I'm sure it's pretty simple, but I've hit an impasse... Help!

2

There are 2 answers

0
ImrikStrider On

I've just figured one solution out, which I'm happy with so I'll mark this question as resolved.

The problem seems to be from the way I was creating my dataframe:

data = sheet.get_all_values()

df = pd.DataFrame(data)

If I instead use the 'Get_all_records()' function then the dataframe is make without a seemingly non-removable column reference numbers as titles (see below)

df = pd.DataFrame(raw.get_all_records())

when I print the head() of this dataframe I get:

   AtomicNumber    Element Symbol  ...  SpecificHeat  NumberofShells  NumberofValence
0             1   Hydrogen      H  ...        14.304               1                1
1             2     Helium     He  ...         5.193               1                 
2             3    Lithium     Li  ...         3.582               2                1
3             4  Beryllium     Be  ...         1.825               2                2
4             5      Boron      B  ...         1.026               2                3

and when I then call df.columns.values, I get:

array(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass',
       'NumberofNeutrons', 'NumberofProtons', 'NumberofElectrons',
       'Period', 'Group', 'Phase', 'Radioactive', 'Natural', 'Metal',
       'Nonmetal', 'Metalloid', 'Type', 'AtomicRadius',
       'Electronegativity', 'FirstIonization', 'Density', 'MeltingPoint',
       'BoilingPoint', 'NumberOfIsotopes', 'Discoverer', 'Year',
       'SpecificHeat', 'NumberofShells', 'NumberofValence'], dtype=object)

I'm going to do a little dive into the documentation of gspread now and try to figure out what the distinction is between get_all_values and get_all_records, but I'm so happy to have figured it out! :-)

1
Alexandra Dudkina On

It seems, that sheet file contains as a first row numbers, which are used as column names. You need to drop first row before converting sheet to dataframe. I'm not very familiar with gspread API, but suppose that following should work:

 data = sheet.get_all_values()[1:]

After that it should be possible to address column by names.