Hive with regexserde property doesn't work properly

47 views Asked by At

I used regex101 website to validate my regex:

([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"

It works fine for the log below

66.240.70.141 - - [01/Mar/2018:06:16:46 +0000] "GET /example.download.handler.com/products/01/00/item/116314/8/002394857_2BB.jpg HTTP/1.1" 200 41710 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB30P) AppleWebKit/536.37 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" "C0T1_19610|3881001|"

But the same expression doesn't work on hive:

CREATE EXTERNAL TABLE `web_logs_test`(   
`ip_address`  string COMMENT '',   
`date_string` string COMMENT '',   
`request`     string COMMENT '', 
`status`      string COMMENT '',   
`bytes`       string COMMENT '',   
`referer`     string COMMENT '',   
`user_agent`  string COMMENT '',   
`cookie`      string COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (  
'input.regex'='([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"'
)
STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/weblogs/data'

If anyone knows, kindly help me out.

Thanks in advance.

1

There are 1 answers

1
user3698991 On
CREATE EXTERNAL TABLE web_logs (
  ip_address STRING,
  date_string STRING,
  request STRING,
  status STRING,
  bytes STRING,
  referer STRING,
  user_agent STRING,
  cookie STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = "^([\\d.]+) \\S+ \\S+ \\[(.+?)\\] \\\"(.+?)\\\" (\\d{3}) (\\d+) \\\"(.+?)\\\" \\\"(.+?)\\\" \\\"SESSIONID=(\\d+)\\\"\\s*"
)
LOCATION '/file_location/web_logs';