Python 3.4 reading from CSV formats

1.2k views Asked by At

OK So i have this code in Python that Im importing from a csv file the problem is that there are columns in that csv file that aren't basic numbers. There is one column that is text in the format "INT, EXT" and there is a column that is in o'clock format from "0:00 to 11:59" format. I have a third column as a normal number distance in "00.00" format.

My question is how do I go about plotting distance vs o'clock and then basing whether one is INT or EXT changing the colors of the dots for the scatterplot.

My first problem is having how to make the program read oclock format. and text formats from a csv.

Any ideas or suggestions? Thanks in advance

Here is a sample of the CSV im trying to import

ML  INT  .10  534.15  0:00
ML  EXT  .25  654.23  3:00
ML  INT  .35  743.12  6:30

I want to plot the 4th column as the x axis and the 5th column as the y axis I also want to color code the scatter plot dots red or blue depending if one is INT or EXT

Here is a sample of the code i have so far

import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np

style.use('ggplot')

a,b,c,d = np.loadtxt('numbers.csv',
                unpack = True,
                delimiter = ',')



plt.scatter(a,b)




plt.title('Charts')
plt.ylabel('Y Axis')
plt.xlabel('X Axis')

plt.show()
2

There are 2 answers

8
Scott On BEST ANSWER

Reading in from your example csv using pandas:

import pandas as pd
import matplotlib.pyplot as plt
import datetime

data = pd.read_csv('data.csv', sep='\t', header=None)
print data

prints:

    0    1     2       3     4
0  ML  INT  0.10  534.15  0:00
1  ML  EXT  0.25  654.23  3:00
2  ML  INT  0.35  743.12  6:30

Then separate the 'INT' from the 'EXT':

ints = data[data[1]=='INT']
exts = data[data[1]=='EXT']

change them to datetime and grab the distances:

int_times = [datetime.datetime.time(datetime.datetime.strptime(t, '%H:%M')) for t in ints[4]]
ext_times = [datetime.datetime.time(datetime.datetime.strptime(t, '%H:%M')) for t in exts[4]]
int_dist = [d for d in ints[3]]
ext_dist = [d for d in exts[3]]

then plot a scatter plot for 'INT' and 'EXT' each:

fig, ax = plt.subplots()
ax.scatter(int_dist, int_times, c='orange', s=150)
ax.scatter(ext_dist, ext_times, c='black', s=150)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.show()

enter image description here

EDIT: Adding code to answer a question in the comments regarding how to change the time to 12 hour format (ranging from 0:00 to 11:59) and strip the seconds.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('data.csv', header=None)
ints = data[data[1]=='INT']
exts = data[data[1]=='EXT']
INT_index = data[data[1]=='INT'].index
EXT_index = data[data[1]=='EXT'].index
time = [t for t in data[4]]
int_dist = [d for d in ints[3]]
ext_dist = [d for d in exts[3]]

fig, ax = plt.subplots()
ax.scatter(int_dist, INT_index, c='orange', s=150)
ax.scatter(ext_dist, EXT_index, c='black', s=150)
ax.set_yticks(np.arange(len(data[4])))
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.show()

enter image description here

0
Scott On

I have worked another answer to this, but will leave the original as I believe it's still good, just not exactly answering your particular question.

I also generated a few more rows of data to make the problem, at least on my end, a bit more meaningful.

What solved this for me was generating a 5th column (in code, not the csv) which is the number of minutes corresponding to a particular o'clock time, i.e. 11:59 maps to 719 min. Using pandas I inserted this new column into the dataframe. I could then place string ticklabels for every hour ('0:00', '1:00', etc.) at every 60 min.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('Workbook2.csv', header=None)
print data

Prints my faked data:

    0    1     2       3      4
0  ML  INT  0.10  534.15   0:00
1  ML  EXT  0.25  654.23   3:00
2  ML  INT  0.30  743.12   6:30
3  ML  EXT  0.35  744.20   4:30
4  ML  INT  0.45  811.47   7:00
5  ML  EXT  0.55  777.90   5:45
6  ML  INT  0.66  854.70   7:54
7  ML  EXT  0.74  798.40   6:55
8  ML  INT  0.87  947.30  11:59 

Now make a function to convert o'clock to minutes:

def convert_to_min(o_clock):
    h, m = o_clock.split(':')
    return int(h) * 60 + int(m)
# using this function create a list times in minutes for each time in col 4
min_col = [convert_to_min(t) for t in data[4]]
data[5] = min_col  # inserts this list as a new column '5'
print data 

Our new data:

    0    1     2       3      4    5
0  ML  INT  0.10  534.15   0:00    0
1  ML  EXT  0.25  654.23   3:00  180
2  ML  INT  0.30  743.12   6:30  390
3  ML  EXT  0.35  744.20   4:30  270
4  ML  INT  0.45  811.47   7:00  420
5  ML  EXT  0.55  777.90   5:45  345
6  ML  INT  0.66  854.70   7:54  474
7  ML  EXT  0.74  798.40   6:55  415
8  ML  INT  0.87  947.30  11:59  719

Now build the x and y axis data, the ticklabels, and the tick locations:

INTs = data[data[1]=='INT']
EXTs = data[data[1]=='EXT']

int_dist = INTs[3]  # x-axis data for INT
ext_dist = EXTs[3]

# plotting time as minutes in range [0 720]
int_time = INTs[5]  # y-axis data for INT
ext_time = EXTs[5]

time = ['0:00', '1:00', '2:00', '3:00', '4:00', '5:00', 
        '6:00', '7:00', '8:00', '9:00', '10:00', '11:00', '12:00']
# this will place the strings above at every 60 min
tick_location = [t*60 for t in range(13)]

Now plot:

fig, ax = plt.subplots()
ax.scatter(int_dist, int_time, c='orange', s=150)
ax.scatter(ext_dist, ext_time, c='black', s=150)
ax.set_yticks(tick_location)
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.title('Seems to work...')
plt.show()

enter image description here