# Horse colic data

*Download the*

*notebook*,

*the*

*raw script*,

*or the*

*annotated script*

*for this tutorial (right-click on the link and save).*

## Initial data processing

In this example, we consider the UCI "horse colic" dataset

This is a reasonably messy classification problem with missing values etc and so some work should be expected in the feature processing.

### Getting the data

The data is pre-split in training and testing and we will keep it as such

```
using MLJ
using HTTP
using CSV
import DataFrames: DataFrame, select!, Not
req1 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data")
req2 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.test")
header = ["surgery", "age", "hospital_number",
"rectal_temperature", "pulse",
"respiratory_rate", "temperature_extremities",
"peripheral_pulse", "mucous_membranes",
"capillary_refill_time", "pain",
"peristalsis", "abdominal_distension",
"nasogastric_tube", "nasogastric_reflux",
"nasogastric_reflux_ph", "feces", "abdomen",
"packed_cell_volume", "total_protein",
"abdomcentesis_appearance", "abdomcentesis_total_protein",
"outcome", "surgical_lesion", "lesion_1", "lesion_2", "lesion_3",
"cp_data"]
csv_opts = (header=header, delim=' ', missingstring="?",
ignorerepeated=true)
data_train = CSV.read(req1.body; csv_opts...)
data_test = CSV.read(req2.body; csv_opts...)
@show size(data_train)
@show size(data_test)
```

```
size(data_train) = (300, 28)
size(data_test) = (68, 28)
```

### Inspecting columns

To simplify the analysis, we will drop the columns `Lesion *`

as they would need specific re-encoding which would distract us a bit.

```
unwanted = [:lesion_1, :lesion_2, :lesion_3]
data = vcat(data_train, data_test)
select!(data, Not(unwanted));
```

Let's also keep track of the initial train-test split

```
train = 1:nrows(data_train)
test = last(train) .+ (1:nrows(data_test));
```

We know from reading the description that some of these features represent multiclass data; to facilitate the interpretation, we can use `autotype`

from `ScientificTypes`

. By default, `autotype`

will check all columns and suggest a Finite type assuming there are relatively few distinct values in the column. More sophisticated rules can be passed, see ScientificTypes.jl:

`datac = coerce(data, autotype(data));`

Let's see column by column whether it looks ok now

```
sch = schema(datac)
for (name, scitype) in zip(sch.names, sch.scitypes)
println(rpad("$name", 30), scitype)
end
```

```
surgery Union{Missing, OrderedFactor{2}}
age OrderedFactor{2}
hospital_number Count
rectal_temperature Union{Missing, Continuous}
pulse Union{Missing, Count}
respiratory_rate Union{Missing, Count}
temperature_extremities Union{Missing, OrderedFactor{4}}
peripheral_pulse Union{Missing, OrderedFactor{4}}
mucous_membranes Union{Missing, OrderedFactor{6}}
capillary_refill_time Union{Missing, OrderedFactor{3}}
pain Union{Missing, OrderedFactor{5}}
peristalsis Union{Missing, OrderedFactor{4}}
abdominal_distension Union{Missing, OrderedFactor{4}}
nasogastric_tube Union{Missing, OrderedFactor{3}}
nasogastric_reflux Union{Missing, OrderedFactor{3}}
nasogastric_reflux_ph Union{Missing, OrderedFactor{24}}
feces Union{Missing, OrderedFactor{4}}
abdomen Union{Missing, OrderedFactor{5}}
packed_cell_volume Union{Missing, Continuous}
total_protein Union{Missing, Continuous}
abdomcentesis_appearance Union{Missing, OrderedFactor{3}}
abdomcentesis_total_protein Union{Missing, Continuous}
outcome Union{Missing, OrderedFactor{3}}
surgical_lesion OrderedFactor{2}
cp_data OrderedFactor{2}
```

Most columns are now treated as either Multiclass or Ordered, this corresponds to the description of the data. For instance:

`Surgery`

is described as`1=yes / 2=no`

`Age`

is described as`1=adult / 2=young`

Inspecting the rest of the descriptions and the current scientific type, there are a few more things that can be observed:

hospital number is still a count, this means that there are relatively many hospitals and so that's probably not very useful,

pulse and respiratory rate are still as count but the data description suggests that they can be considered as continuous

`length(unique(datac.hospital_number))`

`346`

yeah let's drop that

`datac = select!(datac, Not(:hospital_number));`

let's also coerce the pulse and respiratory rate, in fact we can do that with `autotype`

specifying as rule the `discrete_to_continuous`

`datac = coerce(datac, autotype(datac, rules=(:discrete_to_continuous,)));`

### Dealing with missing values

There's quite a lot of missing values, in this tutorial we'll be a bit rough in how we deal with them applying the following rules of thumb:

drop the rows where the outcome is unknown

drop columns with more than 20% missing values

simple imputation of whatever's left

```
missing_outcome = ismissing.(datac.outcome)
idx_missing_outcome = missing_outcome |> findall
```

```
2-element Array{Int64,1}:
133
309
```

Ok there's only two row which is nice, let's remove them from the train/test indices and drop the rows

```
train = setdiff!(train |> collect, idx_missing_outcome)
test = setdiff!(test |> collect, idx_missing_outcome)
datac = datac[.!missing_outcome, :];
```

Now let's look at how many missings there are per features

```
for name in names(datac)
col = datac[:, name]
ratio_missing = sum(ismissing.(col)) / nrows(datac) * 100
println(rpad(name, 30), round(ratio_missing, sigdigits=3))
end
```

```
surgery 0.0
age 0.0
rectal_temperature 18.9
pulse 7.1
respiratory_rate 19.4
temperature_extremities 17.5
peripheral_pulse 22.7
mucous_membranes 13.1
capillary_refill_time 10.4
pain 17.2
peristalsis 13.9
abdominal_distension 17.5
nasogastric_tube 35.5
nasogastric_reflux 36.1
nasogastric_reflux_ph 81.1
feces 34.7
abdomen 39.1
packed_cell_volume 9.84
total_protein 11.5
abdomcentesis_appearance 52.7
abdomcentesis_total_protein 63.9
outcome 0.0
surgical_lesion 0.0
cp_data 0.0
```

Let's drop the ones with more than 20% (quite a few!)

```
unwanted = [:peripheral_pulse, :nasogastric_tube, :nasogastric_reflux,
:nasogastric_reflux_ph, :feces, :abdomen, :abdomcentesis_appearance, :abdomcentesis_total_protein]
select!(datac, Not(unwanted));
```

Note that we could have done this better and investigated the nature of the features for which there's a lot of missing values but don't forget that our goal is to showcase MLJ!

Let's conclude by filling all missing values and separating the feature matrix from the target

```
@load FillImputer
filler = machine(FillImputer(), datac)
fit!(filler)
datac = transform(filler, datac)
y, X = unpack(datac, ==(:outcome), name->true);
X = coerce(X, autotype(X, :discrete_to_continuous));
```

## A baseline model

Let's define a first sensible model and get a baseline, basic steps are:

one-hot-encode the categoricals

feed all this into a classifier

```
@load OneHotEncoder
@load MultinomialClassifier pkg="MLJLinearModels"
```

```
MultinomialClassifier(
lambda = 1.0,
gamma = 0.0,
penalty = :l2,
fit_intercept = true,
penalize_intercept = false,
solver = nothing) @923
```

Let's have convenient handles over the training data

```
Xtrain = X[train,:]
ytrain = y[train];
```

And let's define a pipeline corresponding to the operations above

```
SimplePipe = @pipeline(OneHotEncoder(),
MultinomialClassifier(), prediction_type=:probabilistic)
mach = machine(SimplePipe, Xtrain, ytrain)
res = evaluate!(mach; resampling=Holdout(fraction_train=0.9),
measure=cross_entropy)
round(res.measurement[1], sigdigits=3)
```

`0.704`

This is the cross entropy on some held-out 10% of the training set. We can also just for the sake of getting a baseline, see the misclassification on the whole training data:

```
mcr = misclassification_rate(predict_mode(mach, Xtrain), ytrain)
println(rpad("MNC mcr:", 10), round(mcr, sigdigits=3))
```

```
MNC mcr: 0.234
```

That's not bad at all actually. Let's tune it a bit and see if we can get a bit better than that, not much point in going crazy, we might get a few percents but not much more.

```
model = SimplePipe
lambdas = range(model, :(multinomial_classifier.lambda), lower=1e-3, upper=100, scale=:log10)
tm = TunedModel(model=SimplePipe, ranges=lambdas, measure=cross_entropy)
mtm = machine(tm, Xtrain, ytrain)
fit!(mtm)
best_pipe = fitted_params(mtm).best_model
```

```
Pipeline1889(
one_hot_encoder = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true,
ignore = false),
multinomial_classifier = MultinomialClassifier(
lambda = 27.825594022071243,
gamma = 0.0,
penalty = :l2,
fit_intercept = true,
penalize_intercept = false,
solver = nothing)) @801
```

So it looks like it's useful to regularise a fair bit to get a lower cross entropy

```
ŷ = predict(mtm, Xtrain)
cross_entropy(ŷ, ytrain) |> mean
```

`0.6129047398190444`

Interestingly this does not improve our missclassification rate

```
mcr = misclassification_rate(mode.(ŷ), ytrain)
println(rpad("MNC mcr:", 10), round(mcr, sigdigits=3))
```

```
MNC mcr: 0.278
```

We've probably reached the limit of a simple linear model.

## Trying another model

There are lots of categoricals, so maybe it's just better to use something that deals well with that like a tree-based classifier.

```
@load XGBoostClassifier
dtc = machine(XGBoostClassifier(), Xtrain, ytrain)
fit!(dtc)
ŷ = predict(dtc, Xtrain)
cross_entropy(ŷ, ytrain) |> mean
```

`0.024815097f0`

So we get a worse cross entropy but...

`misclassification_rate(mode.(ŷ), ytrain)`

`0.0033444816053511705`

a significantly better misclassification rate.

We could investigate more, do more tuning etc, but the key points of this tutorial was to show how to handle data with missing values.