How to use non-numeric independent variables while training a Linear Regression Model with MADlib-postgre?

397 views Asked by At

My table contains a character field and two numeric fields:

CREATE TABLE lr_source (Char01 varchar(250)
,PLNumeric01 numeric
,PLNumeric02 numeric);

I want to train the linear regression model with Char01 and PLNumeric01 as independent variables and PLNumeric02 as the dependent variable.

SELECT madlib.linregr_train( 'lr_source',    --source table
                             'lr_model',--model table
                             'PLNumeric02',  --dependent variable
                             'ARRAY[PLNumeric01, Char01 ]' --independent variables
                           );

When I am running above query, it fails with following error:

ERROR:  spiexceptions.DatatypeMismatch: ARRAY types numeric and character varying cannot be matched

How can I use non-numeric fields as an independent variable?

1

There are 1 answers

0
Frank McQuillan On BEST ANSWER

I would suggest you encode your categorical variables as per http://madlib.apache.org/docs/master/group__grp__encode__categorical.html which will make them numeric, and then you can pass them to the linear regression.

Also, you will likely want to add an explicit intercept like in the user doc examples:

SELECT madlib.linregr_train( 'houses',
                             'houses_linregr_bedroom',
                             'price',
                             'ARRAY[1, tax, bath, size]',
                             'bedroom'
                           );