using python, how do we delete the auth_user column from the proxy log file?

96 views Asked by At

I have hundreds of proxy log files in one folder and want delete the auth_user column from all the log files and output them to another folder.

The auth_user column is enclosed by double quotes. The biggest problem is I can not use space characters as the text delimiter, because some log files have no space between timestamp and auth_user. I tried to use double quote as the text delimiter, but this leads to some weird results, since sometimes there is nothing between the pairs of double quotes.

What I have so far:

for src_name in glob.glob(os.path.join(source_dir, '*.log')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir,base)
    with open(src_name, 'rb') as infile:
        with open(dest_name, 'w') as outfile:
             reader = csv.reader(infile, delimiter='"')
             writer = csv.writer(outfile, delimiter='"')
             for row in reader:
                 row[1] = ''
                 writer.writerow(row)

The log file is as follows (time_stamp "auth_user" src_ip):

[21/Apr/2013:00:00:00 -0300]"cn=john smith,ou=central,ou=microsoft,o=com" 192.168.2.5
[21/Apr/2013:00:00:01 -0400]"jsmith" 192.168.4.5
[21/Apr/2013:00:00:01 -0400]"" 192.168.15.5
[22/Apr/2013:00:00:01 -0400]"" 192.168.4.5
[22/Apr/2013:00:00:01 -0400]"jkenndy" 192.168.14.5

I would like to change it into this (time_stamp src_ip):

[21/Apr/2013:00:00:00 -0300] 192.168.2.5
[21/Apr/2013:00:00:01 -0400] 192.168.4.5
[21/Apr/2013:00:00:01 -0400] 192.168.15.5
[22/Apr/2013:00:00:01 -0400] 192.168.4.5
[22/Apr/2013:00:00:01 -0400] 192.168.14.5
3

There are 3 answers

0
albert On

Assuming that each file has the structure:

#[some timestamp here]"auth_user"
#[21/Apr/2013:00:00:00 -0300]""
#[21/Apr/2013:00:00:00 -0300]"username"
#[21/Apr/2013:00:00:00 -0300]"machine$"
#[21/Apr/2013:00:00:00 -0300]"cn=john smith,ou=central,ou=microsoft,o=com"
#[21/Apr/2013:00:00:01 -0400]"jsmith"
#[21/Apr/2013:00:00:01 -0400]""
#[21/Apr/2013:00:00:01 -0400]""

Assuming that the first two lines need to be skipped:

#!/usr/bin/env python3
# coding: utf-8

with open('file.log') as f:
    for line_number, line in enumerate(f):
        # line_number starts at zero, skip both lines at beginning of file
        if line_number > 1:
            # process file here, replace print statement with appropriate code
            print(line)
0
martineau On

I would use the re regular expressions module to break each line of the log file into three groups, and then just write the first and third group to the output file:

import glob
import os
import re

pattern = re.compile(r'''(\[.+\])(".*")( .+)''')

for src_name in glob.glob(os.path.join(source_dir, '*.log')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir, base)
    with open(src_name, 'rt') as infile, open(dest_name, 'wt') as outfile:
        for line in infile:
            groups = pattern.search(line).groups()
            outfile.write(groups[0]+groups[2]+'\n')
0
stevieb On

Instead of using CSV, can you just open the file normally, and use a regex? The following will remove the auth_user column regardless of whether there's a space after the timestamp, or whether there is anything inside the quotes or not:

import re

with open('in.txt', 'r') as fh:
    for line in fh:
        line = re.sub(r'(?:(?<=\d{4}])|(?<=#time_stamp))\s*".*?"', '', line)
        print(line)

Input:

#time_stamp "auth_user" src_ip 
[21/Apr/2013:00:00:00 -0300]"cn=johnsmith,ou=central,ou=microsoft,o=com" 192.168.2.5
[21/Apr/2013:00:00:01 -0400]"jsmith" 192.168.4.5
[21/Apr/2013:00:00:01 -0400]"" 192.168.15.5
[22/Apr/2013:00:00:01 -0400]"" 192.168.4.5
[22/Apr/2013:00:00:01 -0400]"jkenndy" 192.168.14.5

Output:

#time_stamp src_ip
[21/Apr/2013:00:00:00 -0300] 192.168.2.5
[21/Apr/2013:00:00:01 -0400] 192.168.4.5
[21/Apr/2013:00:00:01 -0400] 192.168.15.5
[22/Apr/2013:00:00:01 -0400] 192.168.4.5
[22/Apr/2013:00:00:01 -0400] 192.168.14.5