Remove duplicate rows by counts in Hive SQL?

Question

Remove duplicate rows by counts in Hive SQL?

9.4k views Asked by sharp At 11 June 2015 at 15:22

Some of the articles did help on the stack, however, could not find to delete rows by counts in Hive.

There are 2 row_counts for Apple. How do I select only 1 row count for Apple?

--What data looks like...Total 14 records

customerID     date product_type            
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple       
1234abc       20140505  Apple

--Final Output. Total 10 records

customerID     date product_type    
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple

Original Q&A

There are 2 answers

**Henry L** · Answer 1 · 2015-06-11T17:10:15+00:00

I'd suggest a 2 steps approach. step 1: create a temp table with the duplicate record list inserted, using insert and select like so:

CREATE TABLE #Temp( product_Name Char( 30 ), Date Date, CustomerID int );
INSERT INTO #temp (product_Name, Date, CustomerID)
select x.dup, x.[Product_name] as nameX
      , x.[Date]  as dateX, x.CustomerID
from (
SELECT count(*) as dup
      ,[Product_Name]
      , CustonmerID
      ,[TestDate]
  FROM dbo.[yourtable]
  group by  [Date] ,[Product_Name], CustomerID ) x
  where dup > 1

Then delete the duplicates with

 delete  from 
 dbo.[originaltable] 
 where EXISTS (SELECT product_Name, Date, CustomerID from #Temp WHERE Product_Name= [dbo].[originaltable].Product_Name and Date=[dbo].[originalTable].Date )

step 2: Insert the #temp table contents, which has the unique row into the original table.

**Will Du** · Answer 2 · 2015-06-11T23:43:45+00:00

Will Du On 11 June 2015 at 23:43

select distinct customerID,date,product_type from your_table

TechQA.

Remove duplicate rows by counts in Hive SQL?

There are 2 answers

Related Questions in SQL

Related Questions in HIVE

Related Questions in HIVEQL

Popular Questions

Popular Tags

Trending Questions