Suppose, I have several customers today so I am storing their information like customer_id, customer_name, customer_emailid etc. If my customer is leaving and he wants that his personal information should be removed from my hdfs.
So I have below two approaches to achieve the same.
Approach 1:
1.Create Internal Table on top of HDFS
2.Create external table from first table using filter logic
3.While Creating 2nd Table apply udfs on specific columns for more column filtering
Approach 2:
Spark=> Read, filter, write
Is there any other solution?
Approach 2 is possible in Hive - select, filter, write
Create a table on top of directory in hdfs (external or managed, does not matter in this context, better external if you are going to drop table later and keep the data as is). Insert overwrite table or partition from select with filter.