{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Minimising computation time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`AutoEmulate` can be slow if the input data has many observations (rows) or many output variables. By default, `AutoEmulate` cross-validates each model, so we're computing 5 fits per models. The computation time will be relatively short for datasets up to a few thousands of datapoints, but some models (e.g. Gaussian Processes) don't scale well, so computation time might quickly become an issue. \n", "\n", "In this tutorial we walk through four strategies to speed up `AutoEmulate`:\n", "\n", "1) parallise model fits using `n_jobs` \n", "2) restrict the range of models using the `models` argument \n", "3) run fewer cross validation folds using `cross_validator` \n", "4) for hyperparameter search:\n", " - all of the above\n", " - run fewer iterations using `param_search_iters`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/mstoffel/turing/projects/autoemulate/autoemulate/compare.py:8: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", " from tqdm.autonotebook import tqdm\n" ] } ], "source": [ "from sklearn.datasets import make_regression\n", "from autoemulate.compare import AutoEmulate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((500, 10), (500, 5))" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = make_regression(n_samples=500, n_features=10, n_targets=5)\n", "X.shape, y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And see how long `AutoEmulate` takes to run (without hyperparameter search)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

AutoEmulate is set up with the following settings:

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Simulation input shape (X)(500, 10)
Simulation output shape (y)(500, 5)
Proportion of data for testing (test_set_size)0.2
Scale input data (scale)True
Scaler (scaler)StandardScaler
Do hyperparameter search (param_search)False
Reduce dimensionality (reduce_dim)False
Cross validator (cross_validator)KFold
Parallel jobs (n_jobs)1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "87557f74fe6c4e46b17065237f9a6397", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Initializing: 0%| | 0/8 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelshortrmser2
0RadialBasisFunctionsrbf0.0000111.000000
1SecondOrderPolynomialsop0.0000111.000000
2GaussianProcessgp5.6862340.999072
3ConditionalNeuralProcesscnp12.4488310.995273
4SupportVectorMachinessvm66.6334700.880780
5LightGBMlgbm71.0213540.869580
6GradientBoostinggb78.9726740.836586
7RandomForestrf112.6185630.666424
\n", "" ], "text/plain": [ " model short rmse r2\n", "0 RadialBasisFunctions rbf 0.000011 1.000000\n", "1 SecondOrderPolynomial sop 0.000011 1.000000\n", "2 GaussianProcess gp 5.686234 0.999072\n", "3 ConditionalNeuralProcess cnp 12.448831 0.995273\n", "4 SupportVectorMachines svm 66.633470 0.880780\n", "5 LightGBM lgbm 71.021354 0.869580\n", "6 GradientBoosting gb 78.972674 0.836586\n", "7 RandomForest rf 112.618563 0.666424" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "em.summarise_cv()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we might like to see whether the Support Vector Machines model could do better with better hyperparameters. To minimise computation time, we only take those this one model, and run 50 iterations (instead of the default 20), we run them in parallel using 5 jobs." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f733d145401f4685aae6c2658f22a379", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Initializing: 0%| | 0/1 [00:00