Insert data in hive using multidelimeter

306 views Asked by At

how to insert data in hive using multidelimeter and between the column the delimiter is not specified.

Below is my data :

25380 20130101  2.514 -135.69   58.43     8.3     1.1     4.7     4.9     5.6     0.01 C     1.0    -0.1     0.4    97.3    36.0    69.4 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
25380 20130102  2.514 -135.69   58.43     3.0    -0.3     1.4     1.2     0.0     0.35 C     1.3    -1.0    -0.1   100.0    89.5    98.2 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0

I want to insert only first 7 columns in my hive table and I have a txt file for above data.

create table script :

CREATE TABLE hotcold (a int,b int,c float,d float,e float,f float,g float,h string,i string,j string,k string,l string,m string,n string,o string,p string,q string,
r string,s string,t string,u string,v string,w string,x string,y string,z string,aa string,bb string,cc string,dd string,ee string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='\\s+'); 

My data insert script is below :

LOAD DATA LOCAL INPATH '/home/cloudera/WeatherData.txt' into table hotcold;

Below is my select statement and error:

select * from hotcold;

Error :

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
1

There are 1 answers

0
leftjoin On BEST ANSWER

Each column should have corresponding capturing group () in the regexp.

In the example below first group in the beginning of the string ^(\\d+)-corresponds positive integer,

\\s+ - one or more whitespace delimiters,

second capturing group corresponds positive integer (\\d+),

again \\s+ - one or more whitespace delimiters,

third capturing group - ([+-]?[0-9.]+) - a float, not very strict format, allowing +- sign and any combination of didits and dots

and .* at the end allowing any characters at the end of the string, not captured, add all other columns to the regex, my example contains regex for three columns (three capturing groups):

WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\s+(\\d+)\\s+([+-]?[0-9.]+).*') 

Read this for more details: Using Regular Expressions to Extract Fields for Hive Tables