Remove duplicate rows by counts in Hive SQL?

9.4k views Asked by At

Some of the articles did help on the stack, however, could not find to delete rows by counts in Hive.

There are 2 row_counts for Apple. How do I select only 1 row count for Apple?

--What data looks like...Total 14 records

customerID     date product_type            
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple       
1234abc       20140505  Apple       

--Final Output. Total 10 records

customerID     date product_type    
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple       
2

There are 2 answers

0
Henry L On

I'd suggest a 2 steps approach. step 1: create a temp table with the duplicate record list inserted, using insert and select like so:

CREATE TABLE #Temp( product_Name Char( 30 ), Date Date, CustomerID int );
INSERT INTO #temp (product_Name, Date, CustomerID)
select x.dup, x.[Product_name] as nameX
      , x.[Date]  as dateX, x.CustomerID
from (
SELECT count(*) as dup
      ,[Product_Name]
      , CustonmerID
      ,[TestDate]
  FROM dbo.[yourtable]
  group by  [Date] ,[Product_Name], CustomerID ) x
  where dup > 1

Then delete the duplicates with

 delete  from 
 dbo.[originaltable] 
 where EXISTS (SELECT product_Name, Date, CustomerID from #Temp WHERE Product_Name= [dbo].[originaltable].Product_Name and Date=[dbo].[originalTable].Date )  

step 2: Insert the #temp table contents, which has the unique row into the original table.

0
Will Du On

select distinct customerID,date,product_type from your_table