pass multiple files based on date in same directory as Input to Mapreduce

Question

pass multiple files based on date in same directory as Input to Mapreduce

827 views Asked by Mohan M At 04 January 2017 at 11:42

I have requirement where I have to user multiple files from same directory with specific date as a input to mapreduce job.

not sure how I can do it.

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/*.snappy /user/hdfs/eventlog_output/op1

Example : from eventlog directory I need only present date file for processing.

eventlog directory has gets log data from a flume logger agent so it has 1000 of new files coming on daily basis. I that I need only present date file for my process.

Thanks.

Regards, Mohan.

Original Q&A

There are 1 answers

**Ronak Patel** · Accepted Answer · 2017-01-04T13:12:29+00:00

you can use bash date command as $(date +%Y-%m-%d):

for example, running as below will look for /user/hdfs/eventlog/2017-01-04.snappy log file and output will be stored to /user/hdfs/eventlog_output/2017-01-04 hdfs dir:

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d).snappy /user/hdfs/eventlog_output/$(date +%Y-%m-%d)

to get specific date format see this answer OR type man date command to learn more about date...

update after more details provided:

1. explanation:

$ file=$(hadoop fs -ls /user/cloudera/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')
$ echo $file
/user/cloudera/xyz.snappy
$ file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')
$ echo $file_out
xyz
$hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

2. make shell script to reuse these commands daily... and in more logical way

This script can process more than one files in hdfs for present system date:

#!/bin/sh
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then   
    # file(s) found now create dir
    hadoop fs -mkdir /user/hdfs/eventlog/$(date +%Y-%m-%d)
    counter=0
        #move each file to today's dir
        for file in $files
        do
            hadoop fs -mv $file /user/hdfs/eventlog/$(date +%Y-%m-%d)/
            counter=$(($counter + 1))
        done
    #run hadoop job
    hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d) /user/hdfs/eventlog_output/$(date +%Y-%m-%d)
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."

This script can process more than one files - one file at time - in hdfs for present system date:

#!/bin/sh   
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then
counter=0
    for file in $files
    do    
        echo "Processing file: $file ..."    
        #get output dir name
        file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')

        #run hadoop job
        hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

        counter=$(($counter + 1))
    done
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."

TechQA.

pass multiple files based on date in same directory as Input to Mapreduce

There are 1 answers

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in HADOOP-STREAMING

Popular Questions

Popular Tags

Trending Questions