Linux parsing space delimited log files

1.1k views Asked by At

I need to parse apache-access log files which has 16 space delimited columns, that is,

xyz abc ... ... home?querystring

I need to count total number of hits for each page in that file, that is, total number of home page hits ignoring querystring

For few lines the url is column 16 and for other its 14 or 15. Hence I need to parse each line in reverse order (get the last column, ignore query string of the last column, aggregate page hits)

I am new to linux, shell scripting. How do I approach this, do I have to look into or shell scripting. Can you give a small sample code that would perform such task.

ANSWER: perl one liner solved the problem

perl -lane | scalar array

3

There are 3 answers

0
aplassard On

Well for starters, if you are only interested in working on columns 14-16, I would start by running

cut -d\  -f14-16 <input_file.log> | awk '{ one = match($1,/www/)
                                           two = match($2,/www/)
                                           three = match($3,/www/)
                                           if (one)
                                                print $1
                                           else if(two)
                                                print $2
                                           else if(three)

Note: there are two spaces after the d\

You can then pretty easily just count up the urls that you see. I also think this would be solved a lot easier using a few lines of python or perl.

0
Ed Morton On

It's hard to say without a few lines of concrete sample input and expected output, but it sounds like all you need is:

awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file

For example:

$ cat file                                                                      
xyz abc ... ... http://www.google.com?querystring
xyz abc ... ... some other http://www.google.com?querystring1
xyz abc ... some stuff we ignore http://yahoo.com?querystring1
$ 
$ awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
http://www.google.com 2
http://yahoo.com 1
2
fonini On

You can read line by line of input using the read bash command:

while read my_variable; do
    echo "The text is: $my_variable"
done

To get input from a specific file, use the input redirect <:

while read my_variable; do
    echo "The text is: $my_variable"
done < my_logfile

Now, to get the last column, you can use the ${var##* } construction. For example, if the variable my_var is the string some_file_name, then ${my_var##*_} is the same string, but whith everything before (and including) the last _ deleted.

We come up with:

while read line; do
    echo "The last column is: ${line##* }"
done < my_logfile

If you want to echo it to another file, use the >> redirect:

while read line; do
    echo "The last column is: ${line##* }" >> another_file
done < my_logfile

Now, to take away the querystring, you can use the same technique:

while read line; do
    last_column="${line##* }"
    url="${last_column%%\?*}"
    echo "The last column without querystring is: $url" >> another_file
done < my_logfile

This time, we have %%?* instead of ##*? because we want to delete what's after the first ?, instead of before the last. (Note that I have escaped the character ?, which is special to bash.) You can read all about it here.

I didn't understand where to get the page hits, but I think the main idea is there.

EDIT: Now the code works. I had forgotten the do bash keywork. Also, we need to use >> instead of > in order not to overwrite the another_file every time we do echo "..." > another_file. By using >>, we append to the file. I have also corrected the %% instead of ##.