data type in Julia and MLJ

42 views Asked by At

I am new to Julia and trying to fit a simple classification tree

Packages import and env activation:

using Pkg 
Pkg.activate(".")

using CSV
using DataFrames
using Random
using Downloads
using ARFFFiles
using ScientificTypes
using DataFramesMeta
using DynamicPipe
using MLJ
using MLJDecisionTreeInterface

Data:

titanic_reader  = CSV.File("/home/andrea/dev/julia/titanic.csv"; header = 1);
titanic = DataFrame(titanic_reader);

# remove missing values
titanic =  dropmissing(titanic);


titanic = @transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

Check the data

first (titanic , 3)

3×4 DataFrame
 Row │ class  sex     age      survived 
     │ Cat…   Cat…    Float64  Cat…     
─────┼──────────────────────────────────
   1 │ 3      male       22.0  N
   2 │ 1      female     38.0  Y
   3 │ 3      female     26.0  Y

Check the data schema

schema(titanic);


┌──────────┬───────────────┬───────────────────────────────────┐
│ names    │ scitypes      │ types                             │
├──────────┼───────────────┼───────────────────────────────────┤
│ class    │ Multiclass{3} │ CategoricalValue{Int64, UInt32}   │
│ sex      │ Multiclass{2} │ CategoricalValue{String7, UInt32} │
│ age      │ Continuous    │ Float64                           │
│ survived │ Multiclass{2} │ CategoricalValue{String1, UInt32} │
└──────────┴───────────────┴───────────────────────────────────┘

Schema seems ok to me

Prepare data for modelling:

# target and features
y, X = unpack(titanic, ==(:survived), rng = 123);

# partitiont training & test 
(X_trn, X_tst), (y_trn, y_tst)  = partition((X, y), 0.75, multi=true,  rng=123);

Fit the model:

# model
mod = @load DecisionTreeClassifier pkg = "DecisionTree" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);

and here is the problem:

Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc DecisionTree.DecisionTreeClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Tuple{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractVector{<:Finite}}
└ @ MLJBase ~/.julia/packages/MLJBase/eCnWm/src/machines.jl:231

Clearly when fitting the model:

fit!(fm_mach)

I get an error

[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
Stacktrace:

I am almost sure the error depends on the data type specification but, I cannot work the solution out.

1

There are 1 answers

0
Antonello On

I can replicate your issue by using the Titan dataset from the MLJ function OpenML:

using MLJ
import DataFrames as DF
import DataFramesMeta as DFM

table = OpenML.load(42638)

and then cleaning it a bit to get exactly the same dataset you are using:

titanic =  DF.dropmissing(titanic);
DF.rename!(titanic, "pclass"=>"class")
titanic = titanic[:,[:class,:sex,:survived,:age]] # select only the fields you are using

titanic = DFM.@transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

Now the issue is that the DecisionTreeClassifier model from the DeicisonTree package is very efficient (fast!) but it requires ordered data only.

In this case you could perhaps coerce the class as an ordered field. An alternative is to use the DecisionTreeClassifier model from BetaML, that at the cost of being a bit slower can use any kind of input, including the Missing ones (so no need to drop them or use only that few fields - the original titan dataset has many more fields):

mod = @load DecisionTreeClassifier pkg = "BetaML" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
fit!(fm_mach)

yhat_trn = mode.(predict(fm_mach , X_trn))
accuracy(y_trn,yhat_trn) # 0.91

yhat_tst = mode.(predict(fm_mach , X_tst))
accuracy(y_tst,yhat_tst) # 0.78

Note that there is a nice tutorial exactly on fitting the Titan database with Decision Tree and MLJ here: https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8 .