I am using a csv file extracted from a database of bus records. I want to find and save all the records which correspond to a same user, identified by the same deviceID,from the huge list. There are about 300000 deviceID, with around 3 trips per device. So I want to generate a file that allows given a deviceID, find all the trips the user has undertaken, to find the price, date, etc for the whole year.

I've tried using pandas, and ordering with a dict, but because the keys aren't the usual format, it generates errors.

with open('Salida1.csv',newline='', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#rows1 = list(csv_reader)
#print(len(rows1))
line_count = 0
for row in csv_reader:
    if line_count == 0:
        print(f'Column names are {", ".join(row)}')
        line_count += 1
    else:        
        mydict5ordenado.append(list(row))
        line_count += 1
print(f'Processed {line_count} lines.')
print(row)
print(mydict5ordenado[940000][20])

The deviceID is in row(20).

Column names are id, idapp, date_ws, date_time, date_departure, date_arrival, localizador, netPrice, insurancePrice, source, environment, pay_pal, origin, destination, numPassengers, direction, busType, device, model, version, deviceID, notificationID, newBackOffice, devi

from mydict5ordenado I should be able to find all records which have duplicate deviceID's and be able to save them in a file, or in a dictionary. For example mydictordenado[940000] contains the following information

['9265551', '5712', '2019-03-31', '2019-03-31 03:16:32', '2019-03-31
 18:15:00', '2019-03-31 19:45:00', '1dpyov0', '9.42', '0.00', 
'ANDROID_APP', 'PRO', '0', '344', '9084', '1', 'IDA', 'UNKNOWN', 'WAS-LX1A', 
'WAS-LX1A', '6.3.0', '60501c9494b01f78', '0', '2', '26', '0', '', '9103000011625750'

We want to find all trips related to phone with id - 60501c9494b01f78, in the file.

1 Answers

1
snakecharmerb On Best Solutions

You could use a defaultdict to collect the rows for each device id in a list, with the device id as the key.

Something like this ought to work:

import collections

devicedict = collections.defaultict(list)

with open('Salida1.csv',newline='', mode='r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    next(csv_reader)   # Skip header row
    for row in csv_reader:
        key = row[20]
        devicedict[key].append(row)

Given this example defaultdict:

>>> d
defaultdict(<class 'list'>, {'A': [['A', 1, 2], ['A', 5, 6]], 'B': [['B', 4, 5]]})

To find the device with the most trips:

>>> max(d.items(), key=lambda i: len(i[1]))
('A', [['A', 1, 2], ['A', 5, 6]])

A defaultdict isn't orderable, but you can create a collections.OrderedDict ordered by the number of trips:

>>> od = collections.OrderedDict(sorted(d.items(), key=lambda i: len(i[1])))
>>> od
OrderedDict([('B', [['B', 4, 5]]), ('A', [['A', 1, 2], ['A', 5, 6]])])

If your code runs on Python 3.7+ you can use a normal dict instead of an OrderedDict:

>>> dict(sorted(d.items(), key=lambda i: len(i[1])))
{'B': [['B', 4, 5]], 'A': [['A', 1, 2], ['A', 5, 6]]}

If you only need the trip data, you operate only on the defaultdict's values:

>>> max(d.values(), key=len)
[['A', 1, 2], ['A', 5, 6]]

>>> sorted(d.values(), key=len)
[[['B', 4, 5]], [['A', 1, 2], ['A', 5, 6]]]