A method to remove duplicates from a multi dimensional array by comparing specific values

I am writing a python script for data preprocessing. The data in question is read and stored within the script as a multi dimensional array consisting of data points similar to the ones below.

``````[['United', '-27.654379', '152.917741', 'e10', '1459', '2019-03-18'],
['United', '-27.654379', '152.917741', 'e10', '1449', '2019-03-19']]
``````

Currently i need too remove values within the array that have identical dates so that

``````[['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]
``````

Would become

``````[['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16']]
``````

My current method of doing so (shown below) appears to identify and remove entries with duplicate dates, but some can still be found within the output.

``````    for line in Data_text:
for row in Data_text:
if line[5] == row[5]:
Data_text.remove(row)
``````

Any insight into the faults in my algorithm and/or a better way of doing it would be greatly appreciated.

On Best Solutions

Using pure Python, you can leverage the power of `set` to work in this case:

``````lst = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
['Costco', '-27.213607', '152.996416', 'e10', '1297', '2019-03-16']]

seen = set()
print([x for x in lst if not (x[5] in seen or seen.add(x[5]))])

# [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16']]
``````
On

With python3.7, the code below just works. However, it reserves the last one.

``````data = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

data = list({item[5]: item for item in data}.values())
# [['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]
``````
On

You might want to consider pandas for this type of data and operations:

``````a = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

import pandas as pd

df = pd.DataFrame(a).drop_duplicates(5, keep='first')
``````

Result:

``````df

0           1           2    3     4           5
0  Costco  -27.213607  152.996416  e10  1237  2019-03-16
``````

This is especially useful if the dates have different formats:

``````a2 = [['Costco', '-27.213607', '152.996416', 'e10', '1237', 'March 16, 2019'],
['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

df = pd.DataFrame(a2)
df[5] = pd.to_datetime(df[5])
df.drop_duplicates(5, keep='first')
``````

Still gives correct result:

``````        0           1           2    3     4          5
0  Costco  -27.213607  152.996416  e10  1237 2019-03-16
``````
On

Please try this , new a result_list = [], put the no duplicate record into result_list

``````result_list = []
length = len(Data_text);
for i in range(0, length):
line = Data_text[i]
is_exsit = False
for row in result_list:
if line[5] == row[5]:
is_exsit = True
break

if is_exsit == False:
result_list.append(line)

print(result_list)
``````