Minimising computation time

Minimising computation time#

AutoEmulate can be slow if the input data has many observations (rows) or many output variables. By default, AutoEmulate cross-validates each model, so we’re computing 5 fits per models. The computation time will be relatively short for datasets up to a few thousands of datapoints, but some models (e.g. Gaussian Processes) don’t scale well, so computation time might quickly become an issue.

In this tutorial we walk through four strategies to speed up AutoEmulate:

parallelise model fits using n_jobs
restrict the range of models using the models argument
run fewer cross validation folds using cross_validator
for hyperparameter search:
- all of the above
- run fewer iterations using param_search_iters

from sklearn.datasets import make_regression
from autoemulate.compare import AutoEmulate

Let’s make a dataset.

X, y = make_regression(n_samples=500, n_features=10, n_targets=5)
X.shape, y.shape

((500, 10), (500, 5))

And see how long AutoEmulate takes to run (without hyperparameter search).

import time

start = time.time()

em = AutoEmulate()
em.setup(X, y)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

AutoEmulate is set up with the following settings:

	Values
Simulation input shape (X)	(500, 10)
Simulation output shape (y)	(500, 5)
Proportion of data for testing (test_set_size)	0.2
Scale input data (scale)	True
Scaler (scaler)	StandardScaler
Scale output data (scale_output)	True
Scaler output (scaler_output)	StandardScaler
Do hyperparameter search (param_search)	False
Reduce input dimensionality (reduce_dim)	False
Reduce output dimensionality (reduce_dim_output)	False
Cross validator (cross_validator)	KFold
Parallel jobs (n_jobs)	1

Time taken: 25.927297353744507 seconds

1) parallise model fits using `n_jobs`#

The n_jobs parameter allows you to specify the number of CPU cores to use for parallel processing. Setting n_jobs = -1 uses all available cores, speeding up computations when working with large datasets.

Note: Maxing out all available cores might not always lead to faster computation times. Due to overhead from parallelization, memory bandwidth limitations, and potential load imbalances, using more cores can sometimes result in diminishing returns or even slower performance.

Here we accomplish a speed-up by setting n_jobs to 5.

start = time.time()

em = AutoEmulate()
em.setup(X, y, n_jobs=5, print_setup=False)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Time taken: 38.761271238327026 seconds

2) restrict the range of models#

Another approach is to limit the range of models by selecting a subset of relevant types based on your domain and problem expertise. This selection process typically considers factors such as the nature of the problem, data characteristics or the need for interpretability. By narrowing down the types of models, you can reduce the computational burden and focus on the most promising architectures for your specific task.

em = AutoEmulate()
em.setup(X, y, print_setup=False)

# let's see all models, which we can refer to by short or full name
em.model_registry.get_model_names()

# setup with fewer models
start = time.time()

em.setup(X, y, models=["sop", "rbf", "gb"], print_setup=False)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Time taken: 3.4650332927703857 seconds

3) reduce the number of folds in cross validation using `cross_validator`#

With larger datasets, you might initially want to set the number of folds for the cross validation to 3 instead of 5 (the default), so that there are fewer models to fit. AutoEmulate takes a cross_validator argument, which takes an scklearn cross validator or splitter. Let’s use kfold with 3 splits, which saves 2 model fits per model.

from sklearn.model_selection import KFold

start = time.time()

em = AutoEmulate()
em.setup(X, y, cross_validator=KFold(n_splits=3), print_setup=False)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Time taken: 13.10082483291626 seconds

4) modify hyperparameter search#

If we want to use hyperparameter search, we suddenly have to fit many more models. For each model, we might have 20 different parameter combinations, and because we cross validate each combination, we are running 20 * 5 = 100 model fits per model. It’s therefore recommended to focus on a few models of interest when using hyperparameter search.

To get a ballpark figure for how long hyperparameter search might take, we can run AutoEmulate without hyperparameter search, and then multiply the time taken by the number of parameter combinations we want to try.

start = time.time()

em = AutoEmulate()
em.setup(X, y, print_setup=False)
em.compare()

end = time.time()
run_time = end - start
print(f"Time taken: {run_time} seconds")

Time taken: 25.50056791305542 seconds

The default number parameter combinations to search over (seeparam_search_iters) is 20, so we can expect hyperparameter search to take 20 * run_time seconds. Although this can be sped up by running in parallel or running fewer cross-validation folds, we are usually interested only in some emulator models anyway, which will speed up computation time. To figure out which models to optimise, let’s inspect the cv results from the training data.

em.summarise_cv()

	preprocessing	model	short	fold	rmse	r2
0	None	RadialBasisFunctions	rbf	0	0.000009	1.000000
1	None	SecondOrderPolynomial	sop	0	0.000009	1.000000
2	None	SecondOrderPolynomial	sop	1	0.000009	1.000000
3	None	RadialBasisFunctions	rbf	4	0.000009	1.000000
4	None	SecondOrderPolynomial	sop	4	0.000009	1.000000
5	None	RadialBasisFunctions	rbf	1	0.000010	1.000000
6	None	RadialBasisFunctions	rbf	2	0.000009	1.000000
7	None	SecondOrderPolynomial	sop	2	0.000010	1.000000
8	None	SecondOrderPolynomial	sop	3	0.000010	1.000000
9	None	RadialBasisFunctions	rbf	3	0.000011	1.000000
10	None	GaussianProcess	gp	1	3.034883	0.999651
11	None	GaussianProcess	gp	2	3.425897	0.999507
12	None	GaussianProcess	gp	3	4.468504	0.999220
13	None	GaussianProcess	gp	0	4.591460	0.999217
14	None	GaussianProcess	gp	4	5.246312	0.998802
15	None	SupportVectorMachines	svm	1	45.832126	0.924114
16	None	SupportVectorMachines	svm	2	44.604827	0.920177
17	None	LightGBM	lgbm	3	47.092560	0.913900
18	None	GradientBoosting	gb	3	49.472768	0.905364
19	None	SupportVectorMachines	svm	3	50.629281	0.903580
20	None	LightGBM	lgbm	0	51.450443	0.902617
21	None	LightGBM	lgbm	1	53.374189	0.895335
22	None	LightGBM	lgbm	2	49.642329	0.894976
23	None	LightGBM	lgbm	4	49.987825	0.890181
24	None	GradientBoosting	gb	2	51.906384	0.887541
25	None	GradientBoosting	gb	0	56.426908	0.881789
26	None	GradientBoosting	gb	4	51.798454	0.880268
27	None	GradientBoosting	gb	1	60.257392	0.865384
28	None	SupportVectorMachines	svm	0	62.389989	0.856006
29	None	SupportVectorMachines	svm	4	58.115395	0.853417
30	None	RandomForest	rf	3	85.298890	0.719513
31	None	RandomForest	rf	2	83.610338	0.703613
32	None	RandomForest	rf	4	82.357920	0.698003
33	None	RandomForest	rf	1	93.935286	0.669747
34	None	RandomForest	rf	0	95.912987	0.654376

Now, we might like to see whether the Support Vector Machines model could do better with better hyperparameters. To minimise computation time, we only take those this one model, and run 50 iterations (instead of the default 20), we run them in parallel using 5 jobs.

start = time.time()

em_svm = AutoEmulate()
em_svm.setup(X, y, models=["svm"], param_search=True, param_search_iters=50, 
         n_jobs=5, print_setup=False)
em_svm.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

em_svm.summarise_cv()

Time taken: 7.681802272796631 seconds

	preprocessing	model	short	fold	rmse	r2
0	None	SupportVectorMachines	svm	1	7.739944	0.997790
1	None	SupportVectorMachines	svm	3	8.439675	0.997328
2	None	SupportVectorMachines	svm	2	7.981588	0.997319
3	None	SupportVectorMachines	svm	4	7.893376	0.997296
4	None	SupportVectorMachines	svm	0	8.928658	0.996972

And indeed, the SVM model does perform better with better hyperparameters! Let’s have a look at them. SVM is a pipeline, so we first extract the ‘model’ step, and then get the parameters.

svm = em_svm.get_model(name="svm")
svm.named_steps['model'].get_params()

{'estimator__C': 3.597672564640772,
 'estimator__cache_size': 200,
 'estimator__coef0': 0.8870573198191127,
 'estimator__degree': 4,
 'estimator__epsilon': 0.1107724016285712,
 'estimator__gamma': 'auto',
 'estimator__kernel': 'linear',
 'estimator__max_iter': -1,
 'estimator__normalise_y': True,
 'estimator__shrinking': True,
 'estimator__tol': 0.0003992862238854556,
 'estimator__verbose': False,
 'estimator': SupportVectorMachines(C=3.597672564640772, coef0=0.8870573198191127, degree=4,
                       epsilon=0.1107724016285712, gamma='auto', kernel='linear',
                       tol=0.0003992862238854556),
 'n_jobs': None}

Minimising computation time

Contents

Minimising computation time#

1) parallise model fits using n_jobs#

2) restrict the range of models#

3) reduce the number of folds in cross validation using cross_validator#

4) modify hyperparameter search#

1) parallise model fits using `n_jobs`#

3) reduce the number of folds in cross validation using `cross_validator`#