Airfoil

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save). Main author: Ashrya Agrawal.

Getting started

Here we use the UCI "Airfoil Self-Noise" dataset

Loading and preparing the data

using MLJ
using PrettyPrinting
import DataFrames
import Statistics
using CSV
using PyPlot
using HTTP
using StableRNGs


req = HTTP.get("https://raw.githubusercontent.com/rupakc/UCI-Data-Analysis/master/Airfoil%20Dataset/airfoil_self_noise.dat");

df = CSV.read(req.body; header=[
                  "Frequency","Attack_Angle","Chord+Length",
                  "Free_Velocity","Suction_Side","Scaled_Sound"
              ]
       );
df[1:5, :] |> pretty
┌───────────┬──────────────┬──────────────┬───────────────┬──────────────┬──────────────┐
│ Frequency │ Attack_Angle │ Chord+Length │ Free_Velocity │ Suction_Side │ Scaled_Sound │
│ Int64     │ Float64      │ Float64      │ Float64       │ Float64      │ Float64      │
│ Count     │ Continuous   │ Continuous   │ Continuous    │ Continuous   │ Continuous   │
├───────────┼──────────────┼──────────────┼───────────────┼──────────────┼──────────────┤
│ 800.0     │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 126.201      │
│ 1000.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 125.201      │
│ 1250.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 125.951      │
│ 1600.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 127.591      │
│ 2000.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 127.461      │
└───────────┴──────────────┴──────────────┴───────────────┴──────────────┴──────────────┘

inspect the schema:

schema(df)
┌───────────────┬─────────┬────────────┐
│ _.names       │ _.types │ _.scitypes │
├───────────────┼─────────┼────────────┤
│ Frequency     │ Int64   │ Count      │
│ Attack_Angle  │ Float64 │ Continuous │
│ Chord+Length  │ Float64 │ Continuous │
│ Free_Velocity │ Float64 │ Continuous │
│ Suction_Side  │ Float64 │ Continuous │
│ Scaled_Sound  │ Float64 │ Continuous │
└───────────────┴─────────┴────────────┘
_.nrows = 1503

unpack into the data and labels:

y, X = unpack(df, ==(:Scaled_Sound), col -> true);

Now we Standardize the features using the transformer Standardizer()

X = transform(fit!(machine(Standardizer(), X)), X);

Partition into train and test set

train, test = partition(eachindex(y), 0.7, shuffle=true, rng=StableRNG(612));

Let's first see which models are compatible with the scientific type and machine type of our data

for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: ScikitLearn

Note that if we coerce X.Frequency to Continuous, many more models are available:

coerce!(X, :Frequency=>Continuous)

for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: ARDRegressor , Package: ScikitLearn
Model Name: AdaBoostRegressor , Package: ScikitLearn
Model Name: BaggingRegressor , Package: ScikitLearn
Model Name: BayesianRidgeRegressor , Package: ScikitLearn
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: DummyRegressor , Package: ScikitLearn
Model Name: ElasticNetCVRegressor , Package: ScikitLearn
Model Name: ElasticNetRegressor , Package: MLJLinearModels
Model Name: ElasticNetRegressor , Package: ScikitLearn
Model Name: EpsilonSVR , Package: LIBSVM
Model Name: EvoTreeGaussian , Package: EvoTrees
Model Name: EvoTreeRegressor , Package: EvoTrees
Model Name: ExtraTreesRegressor , Package: ScikitLearn
Model Name: GaussianProcessRegressor , Package: ScikitLearn
Model Name: GradientBoostingRegressor , Package: ScikitLearn
Model Name: HuberRegressor , Package: MLJLinearModels
Model Name: HuberRegressor , Package: ScikitLearn
Model Name: KNNRegressor , Package: NearestNeighbors
Model Name: KNeighborsRegressor , Package: ScikitLearn
Model Name: LADRegressor , Package: MLJLinearModels
Model Name: LGBMRegressor , Package: LightGBM
Model Name: LarsCVRegressor , Package: ScikitLearn
Model Name: LarsRegressor , Package: ScikitLearn
Model Name: LassoCVRegressor , Package: ScikitLearn
Model Name: LassoLarsCVRegressor , Package: ScikitLearn
Model Name: LassoLarsICRegressor , Package: ScikitLearn
Model Name: LassoLarsRegressor , Package: ScikitLearn
Model Name: LassoRegressor , Package: MLJLinearModels
Model Name: LassoRegressor , Package: ScikitLearn
Model Name: LinearRegressor , Package: GLM
Model Name: LinearRegressor , Package: MLJLinearModels
Model Name: LinearRegressor , Package: ScikitLearn
Model Name: NeuralNetworkRegressor , Package: MLJFlux
Model Name: NuSVR , Package: LIBSVM
Model Name: OrthogonalMatchingPursuitCVRegressor , Package: ScikitLearn
Model Name: OrthogonalMatchingPursuitRegressor , Package: ScikitLearn
Model Name: PassiveAggressiveRegressor , Package: ScikitLearn
Model Name: QuantileRegressor , Package: MLJLinearModels
Model Name: RANSACRegressor , Package: ScikitLearn
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: ScikitLearn
Model Name: RidgeCVRegressor , Package: ScikitLearn
Model Name: RidgeRegressor , Package: MLJLinearModels
Model Name: RidgeRegressor , Package: MultivariateStats
Model Name: RidgeRegressor , Package: ScikitLearn
Model Name: RobustRegressor , Package: MLJLinearModels
Model Name: SGDRegressor , Package: ScikitLearn
Model Name: SVMLinearRegressor , Package: ScikitLearn
Model Name: SVMNuRegressor , Package: ScikitLearn
Model Name: SVMRegressor , Package: ScikitLearn
Model Name: TheilSenRegressor , Package: ScikitLearn
Model Name: XGBoostRegressor , Package: XGBoost

DecisionTreeRegressor

We will first try out DecisionTreeRegressor:

dcr = @load DecisionTreeRegressor pkg=DecisionTree

dcrm = machine(dcr, X, y)

fit!(dcrm, rows=train)
pred_dcrm = MLJ.predict(dcrm, rows=test);

Now you can call a loss function to assess the performance on test set.

rms(pred_dcrm, y[test])
2.9034811027815546

RandomForestRegressor

Now let's try out RandomForestRegressor:

rfr = @load RandomForestRegressor pkg=DecisionTree

rfr_m = machine(rfr, X, y);

train on the rows corresponding to train

fit!(rfr_m, rows=train);

predict values on the rows corresponding to test

pred_rfr = MLJ.predict(rfr_m, rows=test);
rms(pred_rfr, y[test])
2.3976941004643058

Unsurprisingly, the RandomForestRegressor does a better job.

Can we do even better? Yeah, we can!! We can make use of Model Tuning.

Tuning

In case you are new to model tuning using MLJ, refer lab5 and model-tuning

Range of values for parameters should be specified to do hyperparameter tuning

r_maxD = range(rfr, :n_trees, lower=9, upper=15)
r_samF = range(rfr, :sampling_fraction, lower=0.6, upper=0.8)
r = [r_maxD, r_samF];

Now we specify how the tuning should be done. Let's just specify a coarse grid tuning with cross validation and instantiate a tuned model:

tuning = Grid(resolution=7)
resampling = CV(nfolds=6)

tm = TunedModel(model=rfr, tuning=tuning,
                resampling=resampling, ranges=r, measure=rms)

rfr_tm = machine(tm, X, y);

train on the rows corresponding to train

fit!(rfr_tm, rows=train);

predict values on the rows corresponding to test

pred_rfr_tm = MLJ.predict(rfr_tm, rows=test);
rms(pred_rfr_tm, y[test])
2.28281927705362

That was great! We have further improved the accuracy

Now to retrieve best model, You can use

fitted_params(rfr_tm).best_model
RandomForestRegressor(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = -1,
    n_trees = 15,
    sampling_fraction = 0.7666666666666667,
    pdf_smoothing = 0.0) @649

Now we can investigate the tuning by using report. Let's plot a heatmap of the measurements:

r = report(rfr_tm)
res = r.plotting

md = res.parameter_values[:,1]
mcw = res.parameter_values[:,2]

figure(figsize=(8,6))
tricontourf(md, mcw, res.measurements)

xlabel("Number of trees", fontsize=14)
ylabel("Sampling fraction", fontsize=14)
xticks(9:1:15, fontsize=12)
yticks(fontsize=12)
Hyperparameter heatmap