handling data set with horrible date time data

305 views Asked by At

I have a huge sensor data set and working on Python. The problem is their date formats. Basically,this is how the date columns look like;

07/ 7/15 06:51

07/ 7/15 06:53

07/ 7/15 06:55

07/ 7/15 06:57

07/ 7/15 06:59

2015-07-07 07:00:46.047

07/ 7/15 07:03

07/ 7/15 07:05

07/ 7/15 07:07

07/ 7/15 07:09

07/ 7/15 07:11

07/ 7/15 07:13

2015-07-07 07:15:53.007

2015-11-14 23:33:43.000

2015-11-14 23:35:44.000

2015-11-14 23:37:43.000

2015-11-14 23:39:43.000

2015-11-14 23:41:43.000

11/14/15 23:42

2015-11-14 23:45:43.000

11/14/15 23:46

2015-11-14 23:49:43.000

2015-11-14 23:51:44.000

I am going to parse dates to use weekdays, weekends and as an extra maybe I will turn them a Julian date format (which uses numbers 1 to 365 instead of regular dates).

I had tried to :

  • Parsing dates while I am reading csv

  • Date until parser; e.g. dateutil.parser.parse(x)

  • Datetime.strptime

but none of them worked. I still cannot parse dates. These data in 10 part excel files.

When I read them with pd.read_csv(......, parse_dates('date')), it reads date columns as 'object' in some files and as 'datetime64' format in other files. But even if the files with format 'datetime64' date's cannot parse and it gives an error :

"Unknown String Format".

Any idea would help!

3

There are 3 answers

8
Sean Parsons On BEST ANSWER

Your probably going to have to munge this with several approaches I haven't done a significant amount of testing but I was able to convert 2 of your different dates(07/ 7/15 06:51, 2015-11-14 23:45:43.000) to datetime objects using:

datetime.datetime.fromtimestamp(dateutil.parser.parse(date).timestamp())

The date parameter inside the parser.parse method would be the varied string format you have for dates.

There might be a better way to do this but try using this approach as a lambda method on the date column to see the result.

df_date = df['date'].apply(lambda d: datetime.datetime.fromtimestamp(dateutil.parser.parse(d).timestamp())
0
Scott Hunter On

The issue seems to be that your data is in a variety of different formats. So instead of hoping some package can deduce the right format, you could try parsing using each of the different formats you expect, and take the one that works.

2
3novak On

If we make the assumption that the formats presented are the only two formats used, the following may do the trick. Just read the data in as strings and then we'll parse from there.

import pandas as pd

df = pd.DataFrame({'date': ['07/7/15 06:51', '07/7/15 06:59', '2015-07-07 07:00:46.047',
                            '11/14/15 23:42', '2015-11-14 23:45:43.000']})

# mask the df based on the date formats
dash_mask = df['date'].str.contains('-')
slash_mask = df['date'].str.contains('/')

# use the masks to apply pd.to_datetime() to only one format at a time
df.loc[dash_mask, 'datetime'] = pd.to_datetime(df.loc[dash_mask, 'date'],
                                               infer_datetime_format=True)
df.loc[slash_mask, 'datetime'] = pd.to_datetime(df.loc[slash_mask, 'date'],
                                               infer_datetime_format=True)

>>> df['datetime'].dt.date
0    2015-07-07
1    2015-07-07
2    2015-07-07
3    2015-11-14
4    2015-11-14

Of course, this could be turned into a function and could accommodate more date formats using a similar approach, but this should get the job done. I'll grant it isn't pretty...

By the way, if you only care about the date and not the time during the day, you could strip that part out if it's causing problems in parsing the dates.

df['only_date'] = df['date'].str.split(' ').str[0]
>>> df
                      date   only_date
0            07/7/15 06:51     07/7/15
1            07/7/15 06:59     07/7/15
2  2015-07-07 07:00:46.047  2015-07-07
3           11/14/15 23:42    11/14/15
4  2015-11-14 23:45:43.000  2015-11-14