Is there a tool in the Hadoop ecosystem which can actually know if new data has been added to the HDFS File System ?
Actually I want to execute remotely a sqoop import job from an external database (no merge, only new table). Then when this data is written in HDFS, it would execute a spark script that would process with the newly data added and do some stuffs.
Is there any feature in Hadoop that does this kind of job ?
I could totally execute the spark script after the sqoop import job is done, but I would like to know if such feature exists and haven't find any yet.
Thanks in advance.
Yes. there is. There's a workflow tool called Oozie within Hadoop ecosystem to handle this kind of scenario.
Oozie provides workflow which can be triggered to run either based on a fixed schedule or data availability. In your case, it'll be considered as data availability. see more details at Oozie doc here : Oozie doc for coordinator job