I've been trying to use Azure Data Lake Analytics to do some analysis over a large group of IIS log files. So far I can get this to work for a single, best-case file using something like this:
@results =
EXTRACT
s_date DateTime,
s_time string,
s_ip string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
sc_status int,
sc_substatus int,
sc_win32status int,
s_timetaken int
FROM @"/input/u_ex151115.log"
USING Extractors.Text(delimiter:' ', skipFirstNRows: 4);
@statuscount = SELECT COUNT(*) AS TheCount,
sc_status
FROM @results
GROUP BY sc_status;
OUTPUT @statuscount
TO @"/output/statuscount_results.tsv"
USING Outputters.Tsv();
As you can see, in the EXTRACT
statement, I'm skipping over the IIS log file header using the skipFirstNRows
attribute. The problem I'm running into is that many of the log files I have as input contain headers in the middle of the file, presumably because the IIS app pool restarted at some point during the day. When I try to include these files in my query, I get the following error:
Unexpected number of columns in input record at line 14. Expected 14 columns, processed 6 columns out of 6.
The error references a location somewhere in the file where it's encountered the header text.
My question is, using the Text extractor, is there a way to direct it to skip processing a line based on the starting character of the line or something similar? Or, will I need to write a custom extractor to accomplish this?
Based on the documentation for the Text extractor, using the
slient
parameter will cause any lines that do not have the correct number of columns to silently fail, allowing processing to continue on to the next line. Since the IIS log header doesn't have the same number of columns as the log data, setting this attribute totrue
solved my problem.So, my revised code looks like: