pass multiple files based on date in same directory as Input to Mapreduce

819 views Asked by At

I have requirement where I have to user multiple files from same directory with specific date as a input to mapreduce job.

not sure how I can do it.

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/*.snappy /user/hdfs/eventlog_output/op1

Example : from eventlog directory I need only present date file for processing.

eventlog directory has gets log data from a flume logger agent so it has 1000 of new files coming on daily basis. I that I need only present date file for my process.

Thanks.

Regards, Mohan.

1

There are 1 answers

3
Ronak Patel On BEST ANSWER

you can use bash date command as $(date +%Y-%m-%d):

for example, running as below will look for /user/hdfs/eventlog/2017-01-04.snappy log file and output will be stored to /user/hdfs/eventlog_output/2017-01-04 hdfs dir:

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d).snappy /user/hdfs/eventlog_output/$(date +%Y-%m-%d)

to get specific date format see this answer OR type man date command to learn more about date...


update after more details provided:

1. explanation:

$ file=$(hadoop fs -ls /user/cloudera/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')
$ echo $file
/user/cloudera/xyz.snappy
$ file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')
$ echo $file_out
xyz
$hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

2. make shell script to reuse these commands daily... and in more logical way

This script can process more than one files in hdfs for present system date:

#!/bin/sh
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then   
    # file(s) found now create dir
    hadoop fs -mkdir /user/hdfs/eventlog/$(date +%Y-%m-%d)
    counter=0
        #move each file to today's dir
        for file in $files
        do
            hadoop fs -mv $file /user/hdfs/eventlog/$(date +%Y-%m-%d)/
            counter=$(($counter + 1))
        done
    #run hadoop job
    hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d) /user/hdfs/eventlog_output/$(date +%Y-%m-%d)
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."

This script can process more than one files - one file at time - in hdfs for present system date:

#!/bin/sh   
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then
counter=0
    for file in $files
    do    
        echo "Processing file: $file ..."    
        #get output dir name
        file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')

        #run hadoop job
        hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

        counter=$(($counter + 1))
    done
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."