Using GLM.jl

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save). Main author: Clarman Cruz.

This juypter lab showcases MLJ in particular using the popular GLM Julia package. We are using two datasets. One dataset was created manually for testing purposes. The other data set is the CollegeDistance dataset from the AER package in R.

We can quickly define our models in MLJ and study their results. It is very easy and consistent.

using MLJ, CategoricalArrays, PrettyPrinting
import DataFrames: DataFrame, nrow
using UrlDownload
@load LinearRegressor pkg = GLM
@load LinearBinaryClassifier pkg=GLM
LinearBinaryClassifier(
    fit_intercept = true,
    link = GLM.LogitLink()) @403

Reading the data

The CollegeDistance dataset was stored in a CSV file. Here, we read the input file.

baseurl = "https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"

dfX = DataFrame(urldownload(baseurl * "X3.csv"))
dfYbinary = DataFrame(urldownload(baseurl * "Y3.csv"))
dfX1 = DataFrame(urldownload(baseurl * "X1.csv"))
dfY1 = DataFrame(urldownload(baseurl * "Y1.csv"));

You can have a look at those using first:

first(dfX, 3)
3×12 DataFrame
│ Row │ gender │ ethnicity │ score   │ fcollege │ mcollege │ home   │ urban  │ unemp   │ wage    │ tuition │ income │ region │
│     │ String │ String    │ Float64 │ String   │ String   │ String │ String │ Float64 │ Float64 │ Float64 │ String │ String │
├─────┼────────┼───────────┼─────────┼──────────┼──────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤
│ 1   │ male   │ other     │ 39.15   │ yes      │ no       │ yes    │ yes    │ 6.2     │ 8.09    │ 0.88915 │ high   │ other  │
│ 2   │ female │ other     │ 48.87   │ no       │ no       │ yes    │ yes    │ 6.2     │ 8.09    │ 0.88915 │ low    │ other  │
│ 3   │ male   │ other     │ 48.74   │ no       │ no       │ yes    │ yes    │ 6.2     │ 8.09    │ 0.88915 │ low    │ other  │

same for Y:

first(dfY1, 3)
3×1 DataFrame
│ Row │ Y         │
│     │ Float64   │
├─────┼───────────┤
│ 1   │ -2.04463  │
│ 2   │ -0.461529 │
│ 3   │ 0.458262  │

Defining the Linear Model

Let see how many MLJ models handle our kind of target which is the y variable.

ms = models() do m
    AbstractVector{Count} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
EvoTreeCount
LinearCountRegressor
XGBoostCount

How about if the type was Continuous:

ms = models() do m
    Vector{Continuous} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
ARDRegressor
AdaBoostRegressor
BaggingRegressor
BayesianRidgeRegressor
ConstantRegressor
DecisionTreeRegressor
DeterministicConstantRegressor
DummyRegressor
ElasticNetCVRegressor
ElasticNetRegressor
ElasticNetRegressor
EpsilonSVR
EvoTreeGaussian
EvoTreeRegressor
ExtraTreesRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HuberRegressor
HuberRegressor
KNNRegressor
KNeighborsRegressor
LADRegressor
LGBMRegressor
LarsCVRegressor
LarsRegressor
LassoCVRegressor
LassoLarsCVRegressor
LassoLarsICRegressor
LassoLarsRegressor
LassoRegressor
LassoRegressor
LinearRegressor
LinearRegressor
LinearRegressor
NeuralNetworkRegressor
NuSVR
OrthogonalMatchingPursuitCVRegressor
OrthogonalMatchingPursuitRegressor
PassiveAggressiveRegressor
QuantileRegressor
RANSACRegressor
RandomForestRegressor
RandomForestRegressor
RidgeCVRegressor
RidgeRegressor
RidgeRegressor
RidgeRegressor
RobustRegressor
SGDRegressor
SVMLinearRegressor
SVMNuRegressor
SVMRegressor
TheilSenRegressor
XGBoostRegressor

We can quickly define our models in MLJ. It is very easy and consistent.

X = copy(dfX1)
y = copy(dfY1)

coerce!(X, autotype(X, :string_to_multiclass))
yv = Vector(y[:, 1])

LinearRegressorPipe = @pipeline(Standardizer(),
                                OneHotEncoder(drop_last = true),
                                LinearRegressor())

LinearModel = machine(LinearRegressorPipe, X, yv)
fit!(LinearModel)
fp = fitted_params(LinearModel)
(standardizer = (mean_and_std_given_feature = Dict(:V1 => (0.0024456300706479973, 1.1309193246154066),:V2 => (-0.015561621122145304, 1.1238897897565245),:V5 => (0.0077036209704558975, 1.1421493464876622),:V3 => (0.02442889884313839, 2.332713568319154),:V4 => (0.15168404285157286, 6.806065861835239)),),
 one_hot_encoder = (fitresult = OneHotEncoderResult @631,),
 linear_regressor = (coef = [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225],
                     intercept = 0.015893883995789806,),
 machines = Machine[Machine{Standardizer} @577, Machine{OneHotEncoder} @580, Machine{LinearRegressor} @579],
 fitted_params_given_machine = OrderedCollections.LittleDict{Any,Any,Array{Any,1},Array{Any,1}}(Machine{Standardizer} @577 => (mean_and_std_given_feature = Dict(:V1 => (0.0024456300706479973, 1.1309193246154066),:V2 => (-0.015561621122145304, 1.1238897897565245),:V5 => (0.0077036209704558975, 1.1421493464876622),:V3 => (0.02442889884313839, 2.332713568319154),:V4 => (0.15168404285157286, 6.806065861835239)),),Machine{OneHotEncoder} @580 => (fitresult = OneHotEncoderResult @631,),Machine{LinearRegressor} @579 => (coef = [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225], intercept = 0.015893883995789806)),)

Reading the Output of Fitting the Linear Model

We can quickly read the results of our models in MLJ. Remember to compute the accuracy of the linear model.

ŷ = MLJ.predict(LinearModel, X)
yhatResponse = [ŷ[i,1].μ for i in 1:nrow(y)]
residuals = y .- yhatResponse
r = report(LinearModel)

k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients:  ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n yhatResponse \n ", yhatResponse[1:5])
println("\n Residuals \n ", y[1:5,1] .- yhatResponse[1:5])
println("\n Standard Error per Coefficient \n", r.report_given_machine[k].stderror)

 Coefficients:  [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225]

 y 
 [-2.0446341129015, -0.461528671336098, 0.458261960749596, 2.2746223981481, -0.969887403007307]

 ŷ 
 Distributions.Normal{Float64}[Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=1.412005563203644, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.47362968068623923, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.7266938985590492, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=-1.8396459459760566, σ=0.9580569656804974)]

 yhatResponse 
 [-1.6915415373374758, 1.412005563203644, 0.47362968068623923, 0.7266938985590492, -1.8396459459760566]

 Residuals 
 [-0.3530925755640242, -1.8735342345397419, -0.01536771993664321, 1.547928499589051, 0.8697585429687495]

 Standard Error per Coefficient 
[0.01587640310780568, 0.015862782503144917, 0.01515900587321476, 0.01515667698600387, 0.016546721612329368, 0.015148210698700702]

and get the accuracy

round(rms(yhatResponse, y[:,1]), sigdigits=4)
0.9573

Defining the Logistic Model

X = copy(dfX)
y = copy(dfYbinary)

coerce!(X, autotype(X, :string_to_multiclass))
yc = CategoricalArray(y[:, 1])
yc = coerce(yc, OrderedFactor)

LinearBinaryClassifierPipe = @pipeline(Standardizer(),
                                       OneHotEncoder(drop_last = true),
                                       LinearBinaryClassifier())

LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)
fp = fitted_params(LogisticModel)
(standardizer = (mean_and_std_given_feature = Dict(:wage => (9.500506478338009, 1.3430670761078416),:unemp => (7.597214581091511, 2.763580873344848),:tuition => (0.8146082493518824, 0.33950381985971717),:score => (50.88902933684601, 8.701909614072397)),),
 one_hot_encoder = (fitresult = OneHotEncoderResult @199,),
 linear_binary_classifier = Any[],
 machines = Machine[Machine{Standardizer} @204, Machine{OneHotEncoder} @981, Machine{LinearBinaryClassifier{LogitLink}} @865],
 fitted_params_given_machine = OrderedCollections.LittleDict{Any,Any,Array{Any,1},Array{Any,1}}(Machine{Standardizer} @204 => (mean_and_std_given_feature = Dict(:wage => (9.500506478338009, 1.3430670761078416),:unemp => (7.597214581091511, 2.763580873344848),:tuition => (0.8146082493518824, 0.33950381985971717),:score => (50.88902933684601, 8.701909614072397)),),Machine{OneHotEncoder} @981 => (fitresult = OneHotEncoderResult @199,),Machine{LinearBinaryClassifier{LogitLink}} @865 => (coef = [0.20250729378868743, 0.130752939109129, 0.344951624939835, 0.9977565847160846, -0.502231510298459, -0.4785005626021652, -0.20440507809955, -0.06922751403500076, 0.05892864973017095, -0.0834474982820323, -0.002315143333859721, 0.4617765395578658, 0.38432629581007743], intercept = -1.0766338905793655)),)

Reading the Output from the Prediction of the Logistic Model

The output of the MLJ model basically contain the same information as the R version of the model.

ŷ = MLJ.predict(LogisticModel, X)
residuals = [1 - pdf(ŷ[i], y[i,1]) for i in 1:nrow(y)]
r = report(LogisticModel)

k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients:  ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n residuals \n ", residuals[1:5])
println("\n Standard Error per Coefficient \n", r.report_given_machine[k].stderror)

 Coefficients:  [0.20250729378868743, 0.130752939109129, 0.344951624939835, 0.9977565847160846, -0.502231510298459, -0.4785005626021652, -0.20440507809955, -0.06922751403500076, 0.05892864973017095, -0.0834474982820323, -0.002315143333859721, 0.4617765395578658, 0.38432629581007743]

 y 
 [0, 0, 0, 0, 0]

 ŷ 
 UnivariateFinite{OrderedFactor{2},Int64,UInt32,Float64}[UnivariateFinite{OrderedFactor{2}}(0=>0.881, 1=>0.119), UnivariateFinite{OrderedFactor{2}}(0=>0.838, 1=>0.162), UnivariateFinite{OrderedFactor{2}}(0=>0.866, 1=>0.134), UnivariateFinite{OrderedFactor{2}}(0=>0.936, 1=>0.0637), UnivariateFinite{OrderedFactor{2}}(0=>0.944, 1=>0.056)]

 residuals 
 [0.119446033467422, 0.16182691493524637, 0.13445730373831222, 0.06370799769022906, 0.05604680411361729]

 Standard Error per Coefficient 
[0.07542967234030673, 0.12260004202741961, 0.10934317995152518, 0.04661437250372939, 0.09609243724815358, 0.10743620672240184, 0.10642223545563925, 0.09190778860389334, 0.03922724536508866, 0.04118915117919154, 0.05115399636339275, 0.08454431256127862, 0.12281455657940024, 0.17884724866298302]

No logistic analysis is complete without the confusion matrix:

yMode = [mode(ŷ[i]) for i in 1:length(ŷ)]
y = coerce(y[:,1], OrderedFactor)
yMode = coerce(yMode, OrderedFactor)
confusion_matrix(yMode, y)
              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │      0      │      1      │
├─────────────┼─────────────┼─────────────┤
│      0      │    3283     │     831     │
├─────────────┼─────────────┼─────────────┤
│      1      │     236     │     389     │
└─────────────┴─────────────┴─────────────┘