Best way to search for number of hours in the file

196 views Asked by At

I have a log file containing timestamp and data (delimited with ','). I would like to have a Python script to parse the log file to count the number of data occurring for each hour.

Here's the example of the log file content:

2015-06-18 09:04:04.377,54954.418
2015-06-18 09:04:48.863,54965.438
2015-06-18 09:05:29.080,49.813
2015-06-18 09:06:04.697,45.187
2015-06-18 09:06:40.719,45.238
2015-06-18 09:07:09.693,38.768
2015-06-18 09:07:35.856,36.315
2015-06-18 09:08:06.961,39.789
2015-06-18 09:08:33.241,36.147
2015-06-18 09:09:02.801,38.473
2015-06-18 09:09:36.559,44.839
2015-06-18 09:10:13.222,46.165
2015-06-18 09:10:47.867,44.115
2015-06-18 09:11:25.807,46.985
2015-06-18 09:12:00.512,43.607
2015-06-18 09:12:37.513,46.552
2015-06-18 09:13:10.408,41.507
2015-06-18 10:13:44.107,43.269
2015-06-18 10:14:20.501,47.001
2015-06-18 10:15:00.061,52.589
2015-06-18 11:15:33.501,42.148
2015-06-18 11:16:07.558,43.919
2015-06-18 11:16:41.851,43.369
2015-06-18 11:17:15.159,43.336
2015-06-18 11:17:47.217,40.965
2015-06-18 11:18:23.135,44.12
2015-06-18 11:18:55.547,41.432
2015-06-18 12:19:32.362,45.522
2015-06-18 12:20:04.456,42.339
2015-06-18 12:20:36.559,40.555
2015-06-18 12:21:08.409,40.534
2015-06-18 12:21:38.170,38.706
2015-06-18 12:22:09.108,38.653
2015-06-18 12:22:34.420,33.234
2015-06-18 12:23:01.319,35.665

So for this, 9am have total of 17, 10am have 3 and so on... How can I go about doing that?

5

There are 5 answers

1
sachin saxena On

this can be easily done by using pandas :

import pandas as pd
data = pd.read_csv('log.csv')
data['time'] = pd.to_datetime(data['time'])
data.index = data['time']
data['count'] = 1
hour_count = data['count'].resample('1H',how='count')
0
Peter Wood On

You can use collections.Counter, which is like a histogram.

You're really only interested in the first 13 characters of the line. You can slice these, e.g:

>>> line = '2015-06-18 09:11:25.807,46.985'
>>> line[:13]
2015-06-18 09

Putting it together:

data = """2015-06-18 09:11:25.807,46.985
2015-06-18 09:12:00.512,43.607
2015-06-18 09:12:37.513,46.552
2015-06-18 09:13:10.408,41.507
2015-06-18 10:13:44.107,43.269
2015-06-18 10:14:20.501,47.001
2015-06-18 10:15:00.061,52.589
2015-06-18 11:15:33.501,42.148
2015-06-18 11:16:07.558,43.919"""

from collections import Counter
c = Counter(line[:13] for line in data.split('\n'))
print c

Output:

Counter({'2015-06-18 09': 4, '2015-06-18 10': 3, '2015-06-18 11': 2})
1
Martin Evans On

The following should work using simple Python and no additional libraries. It would also be more suitable if your CSV file is huge as you would not want to try and load the whole file into memory.

sHour = ""
nThisHour = 1

with open('log.csv') as ff:
    for line in ff:
        sCurHour = line[11:13]

        if sHour == sCurHour:
            nThisHour += 1
        else:
            if sHour:
                print nThisHour

            nThisHour = 1
            sHour = sCurHour

    print nThisHour

This gives the following output, which will be in the same order as the input:

17
3
7
8

The line slice could be expanded if the date is also important. This would be the case if the log is not changed during a day.

2
LittleQ On

If we consider the same data at the same hour as different, which means:

2015-06-18 09:06:04.697,45.187
2015-06-18 09:06:40.719,45.187

count as twice.

The simplest way:

d = defaultdict(list)
with open(file, 'r') as f:
    for line in f.xreadlines():
        d[line.strip()[:13]] += 1
0
Ali SAID OMAR On

Here the complete API to handle day/hour/min/ms counter, also works with the path to teh log file.

from collections import defaultdict, Counter
import re
import json

def _get(pattern, line):
    return re.findall(pattern, line)

def get(infile, days=False, hours=True, mils=False, min_=False, sec=False):
    days_pattern = "\d{4}\-\d{1,2}-\d{1,2}"
    days_hours_pattern = days_pattern + "\s?\d{1,2}"
    days_min_pattern = days_pattern + "\s?\d{1,2}:\d{1,2}"
    day_hours_min_s_pattern = days_pattern + "\s?\d{1,2}:\d{1,2}:\d{1,2}"
    day_hours_min_ms_pattern = day_hours_min_s_pattern + '\.\d+,\d+'

    result = dict()
    result['days'] = defaultdict(list)
    result['hours'] = defaultdict(list)
    result['ms'] = defaultdict(list)
    result['min'] = defaultdict(list)
    result['sec'] = defaultdict(list)

    with open(infile) as fh:
        for line in fh:
            if days:
                for cdays in _get(days_pattern, line):
                    result['days'][cdays].append(cdays)
            if hours:
                for chour in _get(days_hours_pattern, line):
                    result['hours'][chour].append(chour)
            if min_:
                for min in _get(days_min_pattern, line):
                    result['min'][min].append(min)
            if sec:
                for sec in _get(day_hours_min_s_pattern, line):
                    result['sec'][sec].append(sec)
            if mils:
                for mils in _get(day_hours_min_ms_pattern, line):
                    result['ms'][mils].append(mils)
    summary = dict()
    for k in result:
        for i in result[k]:
            summary[i] = Counter(result[k][i])
    return result, summary

fin = "./in.txt"
result, sum = get(fin, days=True, mils=True, min_=True, hours=True, sec=True)

# works
sum['2015-06-18']
sum['2015-06-18 09']
sum['2015-06-18 09:04']
sum['2015-06-18 09:04']
sum['2015-06-18 09:04:04']
sum["2015-06-18 09:04:04.377,54954"]