Skip Line By Prefix

68 views Asked by At

I've been trying to use Azure Data Lake Analytics to do some analysis over a large group of IIS log files. So far I can get this to work for a single, best-case file using something like this:

@results = 
     EXTRACT
     s_date DateTime,
     s_time string,
     s_ip string,
     cs_method string,
     cs_uristem string,
     cs_uriquery string,
     s_port int,
     cs_username string,
     c_ip string,
     cs_useragent string,
     sc_status int,
     sc_substatus int,
     sc_win32status int,
     s_timetaken int
FROM @"/input/u_ex151115.log"
USING Extractors.Text(delimiter:' ', skipFirstNRows: 4);

@statuscount = SELECT COUNT(*) AS TheCount,
           sc_status
           FROM @results
           GROUP BY sc_status;

OUTPUT @statuscount
  TO  @"/output/statuscount_results.tsv"
USING Outputters.Tsv();

As you can see, in the EXTRACT statement, I'm skipping over the IIS log file header using the skipFirstNRows attribute. The problem I'm running into is that many of the log files I have as input contain headers in the middle of the file, presumably because the IIS app pool restarted at some point during the day. When I try to include these files in my query, I get the following error:

Unexpected number of columns in input record at line 14. Expected 14 columns, processed 6 columns out of 6.

The error references a location somewhere in the file where it's encountered the header text.

My question is, using the Text extractor, is there a way to direct it to skip processing a line based on the starting character of the line or something similar? Or, will I need to write a custom extractor to accomplish this?

1

There are 1 answers

0
chris.house.00 On BEST ANSWER

Based on the documentation for the Text extractor, using the slient parameter will cause any lines that do not have the correct number of columns to silently fail, allowing processing to continue on to the next line. Since the IIS log header doesn't have the same number of columns as the log data, setting this attribute to true solved my problem.

So, my revised code looks like:

@results = 
     EXTRACT
     s_date DateTime,
     s_time string,
     s_ip string,
     cs_method string,
     cs_uristem string,
     cs_uriquery string,
     s_port int,
     cs_username string,
     c_ip string,
     cs_useragent string,
     sc_status int,
     sc_substatus int,
     sc_win32status int,
     s_timetaken int
FROM @"/input/u_ex140521.log"
USING Extractors.Text(delimiter:' ', silent: true);
@statuscount = SELECT COUNT(*) AS TheCount,
           sc_status
           FROM @results
           GROUP BY sc_status;

OUTPUT @statuscount
  TO  @"/output/statuscount_results.tsv"
  USING Outputters.Tsv();