Insert overwrite doesn't delete all the old data files

437 views Asked by At

We are trying to insert overwrite a hive table. Most of the times it's overwriting as expected, i.e deleting any old files and replace new files. We are seeing some inconsistencies with this behavior, once in a while all the old files are not getting deleted, but new files are getting created. This is causing data inconsistency.

I am not able to reproduce this behavior. Just wanted to know if any one has faced similar issue or have any pointer of what might be happening.

We are using hive version 2.1.1.

Below is the orc table structure and the insert overwrite command. Fileid is the unique column in the table. This table size is around 500GB.

Hive table structure:

CREATE EXTERNAL TABLE `tier0.file`(
  `filegroup` struct<collection:struct<name:string,code:string,royaltystate:string,enterprisecollectionid:bigint,isactive:boolean,active:boolean,filefamily:string,contentfamily:string,cfwcollectionname:string,droplocation:string,applyembeddestinationsite:boolean,associatedsource:string,excluderestriction:boolean,ownershiptype:string,collectionid:bigint,notes:string,bundlerestrictions:array<struct<bundleid:bigint,bundletype:string>>,pricecodes:array<struct<collectioncode:string,pricecode:string,iptccategory:string>>>,istockcollection:string,events:array<string>,paidassignmentids:array<string>,sisterfiles:array<string>,clonedfiles:array<string>,vcd:array<string>,source:struct<parentsource:string,parentsourceid:bigint,childsource:string,childsourceid:bigint>>, 
  `filemanagement` struct<filemanagement:string,destinationsites:array<string>,readyforsale:boolean,readyforpublish:boolean,reviewstatus:string,excludedestinationsites:array<string>,displaystatus:string,inactivedate:string,pulledreason:string,pulledreasonaudit:string,approvaldate:string,futurepulledreason:string,futureinactivedate:string,futureactivedate:string>, 
  `primarylanguage` string, 
  `audithistory` struct<note:string,notecategory:string>, 
  `contents` array<struct<deliverylocation:string,contenttype:string,submission:array<struct<data:struct<mimetype:string,fileinfo:struct<filelocation:string,filesize:bigint,filename:string,checksum:string,checksumtype:string>,submitdate:string,createdate:string,mediaformat:string,offlinehd:boolean,postertime:double,shoottype:string,stripaudio:boolean,timein:string,timeout:string,videoencoding:struct<compression:string,bitdepth:string,bitrate:double,definition:string,framerate:string,framesize:string,scantype:string,wrapper:string,height:int,width:int,interlaced:boolean>,rotation:string,anamorphic:boolean,pixelwidth:int,pixelheight:int,colorprofile:string,samplesperpixel:string,resolution:string,resolutionunit:string,colormode:string,animated:boolean,imageorientation:string,filmformat:string,duration:string,artistname:string,directlicense:boolean,lyrichook:string,albumtitle:string,parenttrackid:string,key:string,timesignature:string,publicdomain:string,lyrics:string,tracktitle:string,tracktype:string,speed:string,genre:string,mood:string,lyricpov:string,instrument:string,vocal:string,transformedmetadata:map<string,string>,iptc:map<string,string>,exif:map<string,string>,xmp:map<string,string>,xmpraw:map<string,string>>,sizeid:int,sizename:string,keyname:string,schemauri:string,extension:string,fileindex:int,suffix:string,readonly:boolean,ismaster:boolean>>,filepack:array<struct<data:struct<mimetype:string,fileinfo:struct<filelocation:string,filesize:bigint,filename:string,checksum:string,checksumtype:string>,submitdate:string,createdate:string,mediaformat:string,offlinehd:boolean,postertime:double,shoottype:string,stripaudio:boolean,timein:string,timeout:string,videoencoding:struct<compression:string,bitdepth:string,bitrate:double,definition:string,framerate:string,framesize:string,scantype:string,wrapper:string,height:int,width:int,interlaced:boolean>,rotation:string,anamorphic:boolean,pixelwidth:int,pixelheight:int,colorprofile:string,samplesperpixel:string,resolution:string,resolutionunit:string,colormode:string,animated:boolean,imageorientation:string,filmformat:string,duration:string,artistname:string,directlicense:boolean,lyrichook:string,albumtitle:string,parenttrackid:string,key:string,timesignature:string,publicdomain:string,lyrics:string,tracktitle:string,tracktype:string,speed:string,genre:string,mood:string,lyricpov:string,instrument:string,vocal:string,transformedmetadata:map<string,string>,iptc:map<string,string>,exif:map<string,string>,xmp:map<string,string>,xmpraw:map<string,string>>,sizeid:int,sizename:string,keyname:string,schemauri:string,extension:string,fileindex:int,suffix:string,readonly:boolean,ismaster:boolean>>,createdate:string,camerashotdate:string,updatedate:string,audithistory:array<struct<note:string,notecategory:string>>,contract:struct<parentsource:string,contractid:bigint,contentprovidername:string,contentprovidertitle:string,vendornumber:bigint,childsource:string,parentsourceid:bigint,childsourceid:bigint,istockusername:string,istockuserid:bigint,iptccredit:string,signatorycontentprovidername:string,signatoryguid:string,startdate:string,enddate:string>,release:struct<releaseid:string,releaseinformation:string,releasemetadata:array<struct<releasemetadataid:string,aliasid:string,releasetype:string,filelocation:string,name:string,agerange:string,age:string,birthdate:string,gender:string,ethnicity:string,ethnicities:array<string>,talentid:array<string>,usage:array<string>,teamsreleaseid:string>>>,contentmanagement:struct<state:string,notes:string,messages:array<string>>,contentsource:struct<clientsystemid:string,submittedby:string,ingestionproviderid:int,submissionnotes:string,clientlastmodifieddate:string>,alternateids:array<struct<alternateid:string,alternateidtype:string>>,homeproperty:string,mediatype:int,colorpalettes:struct<rgbmodel:array<struct<red:int,green:int,blue:int,presence:string,x:string,y:string,density:string>>>,transcript:string,hasaudio:boolean,visualcolor:string,era:string,cliptype:string,productiontitle:string,footagespeed:string>>, 
  `submitdate` string, 
  `licensecharacteristics` struct<filefamily:string,restrictioninstructions:string,riskcategory:string,advancedroyaltybearing:boolean,pricingcode:string,callforimage:boolean,exclusivecontent:boolean,subscriptioneligible:boolean,publicistapprovalrequired:boolean,whollyowned:boolean,royaltybearing:string,bundletags:array<string>,paidassignment:boolean,preferredlicensemodel:string,exclusivity:string,parentbundlecollection:string,restrictions:array<struct<id:string,beginningdate:string,enddate:string,controlledrestrictions:array<string>>>>, 
  `fileid` string, 
  `updatedate` string, 
  `version` int, 
  `exclusionrouting` array<string>, 
  `inclusionrouting` array<string>, 
  `errors` map<string,array<struct<errorcode:string,message:string>>>, 
  `dp_schema` string, 
  `dp_source` string, 
  `dp_source_type` string, 
  `dp_proc_time` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3a://bucket/tier0/file/'

Insert overwrite Command:-

insert overwrite table stg.tier0_file
SELECT 
  filegroup, 
  filemanagement, 
  primarylanguage,
  audithistory,
  contents,
  submitdate,
  licensecharacteristics,  
  fileid,  
  updatedate,  
  version,  
  errors,
  dp_schema , 
  dp_source , 
  dp_source_type , 
  dp_proc_time
FROM (
SELECT 
  filegroup, 
  filemanagement, 
  primarylanguage,
  audithistory,
  contents,
  submitdate,
  licensecharacteristics,  
  fileid,  
  updatedate,  
  version, 
  errors,
  dp_schema , 
  dp_source , 
  dp_source_type , 
  dp_proc_time,
  ROW_NUMBER() OVER(PARTITION BY fileid     ORDER BY        version DESC,       dp_proc_time DESC) AS rownum
  FROM 
  (   SELECT 
          filegroup,filemanagement,primarylanguage,audithistory,contents,submitdate,licensecharacteristics,fileid,updatedate,version,errors,dp_schema,dp_source,dp_source_type,dp_proc_time
      FROM tier0.file
      UNION ALL
      SELECT
          filegroup,filemanagement,primarylanguage,audithistory,contents,submitdate,licensecharacteristics,fileid,updatedate,version,errors,dp_schema,dp_source,dp_source_type,dp_proc_time
      FROM stg.file
  ) base ) rnk 
where rnk.rownum = 1;
0

There are 0 answers