I have a dataframe. It contains df['article_id']. I'm using to_sql function with sqlalchemy to insert into my database. However, sometimes I have duplicate records that I want to remove before inserting.

This is my list:

usedIDs = []
select_st = select([article_table])
res = conn.execute(select_st)
for _row in res:
    clean = int(_row[1])
    usedIDs.append(clean)

usedIDs

With output:

[1202623831,
 1747352473,
 1748645480,
 1759957596,
 1811054956,
 1812183879,
 1816974229,
 2450784233,
 2579244390,
 2580336884]

What i've tried:

df[~df.isin(usedIDs)]
df.drop(usedIDs, axis=0)

And this does not work. However when I hardcode it like below, it does work.

df = df[~df.article_id.isin(['1202623831','1747352473'])]

Error is either unhashable or KeyError: not found in axis.

How can I drop the rows from my dataframe where df['article_id'] is in usedIDs list?

1 Answers

1
ashish14 On Best Solutions

Just using "isin" will suffice like this on a sample data:

df
    one date
0   1   2019-05-10 06:00:16
1   2   2019-05-10 06:30:21
2   3   2019-05-10 07:00:03
3   4   2019-05-10 06:32:43
4   5   2019-05-10 07:33:31
5   6   2019-05-10 07:37:39:09
6   7   2019-05-10 07:49:01
7   8   2019-05-10 08:52:05
8   9   2019-05-10 08:29:44:10

df = df[~df.one.isin([1,2])]

df
    one date
2   3   2019-05-10 07:00:03
3   4   2019-05-10 06:32:43
4   5   2019-05-10 07:33:31
5   6   2019-05-10 07:37:39:09
6   7   2019-05-10 07:49:01
7   8   2019-05-10 08:52:05
8   9   2019-05-10 08:29:44:10

This works because you have changed the datatype from int to string

df = df[~df.article_id.isin(['1202623831','1747352473'])]

Try converting userIDs to strings like this:

userIDs = [str(userid) for userid in userIDs]