Minimising computation time#
AutoEmulate
can be slow if the input data has many observations (rows) or many output variables. By default, AutoEmulate
cross-validates each model, so we’re computing 5 fits per models. The computation time will be relatively short for datasets up to a few thousands of datapoints, but some models (e.g. Gaussian Processes) don’t scale well, so computation time might quickly become an issue.
In this tutorial we walk through four strategies to speed up AutoEmulate
:
parallise model fits using
n_jobs
restrict the range of models using the
models
argumentrun fewer cross validation folds using
cross_validator
for hyperparameter search:
all of the above
run fewer iterations using
param_search_iters
from sklearn.datasets import make_regression
from autoemulate.compare import AutoEmulate
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/autoemulate/compare.py:8: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
Let’s make a dataset.
X, y = make_regression(n_samples=500, n_features=10, n_targets=5)
X.shape, y.shape
((500, 10), (500, 5))
And see how long AutoEmulate
takes to run (without hyperparameter search).
import time
start = time.time()
em = AutoEmulate()
em.setup(X, y)
em.compare()
end = time.time()
print(f"Time taken: {end - start} seconds")
AutoEmulate is set up with the following settings:
Values | |
---|---|
Simulation input shape (X) | (500, 10) |
Simulation output shape (y) | (500, 5) |
Proportion of data for testing (test_set_size) | 0.2 |
Scale input data (scale) | True |
Scaler (scaler) | StandardScaler |
Do hyperparameter search (param_search) | False |
Reduce dimensionality (reduce_dim) | False |
Cross validator (cross_validator) | KFold |
Parallel jobs (n_jobs) | 1 |
1) parallise model fits using n_jobs
#
The n_jobs parameter allows you to specify the number of CPU cores to use for parallel processing. Setting n_jobs = -1 uses all available cores, speeding up computations when working with large datasets.
Note: Maxing out all available cores might not always lead to faster computation times. Due to overhead from parallelization, memory bandwidth limitations, and potential load imbalances, using more cores can sometimes result in diminishing returns or even slower performance.
Here we accomplish a speed-up by setting n_jobs to 5.
start = time.time()
em = AutoEmulate()
em.setup(X, y, n_jobs=5, print_setup=False)
em.compare()
end = time.time()
print(f"Time taken: {end - start} seconds")
Time taken: 19.856790781021118 seconds
2) restrict the range of models#
Another approach is to limit the range of models by selecting a subset of relevant types based on your domain and problem expertise. This selection process typically considers factors such as the nature of the problem, data characteristics or the need for interpretability. By narrowing down the types of models, you can reduce the computational burden and focus on the most promising architectures for your specific task.
em = AutoEmulate()
em.setup(X, y, print_setup=False)
# let's see all models, which we can refer to by short or full name
em.model_registry.get_model_names()
# setup with fewer models
start = time.time()
em.setup(X, y, models=["sop", "rbf", "gb"], print_setup=False)
em.compare()
end = time.time()
print(f"Time taken: {end - start} seconds")
Time taken: 2.7409257888793945 seconds
3) reduce the number of folds in cross validation using cross_validator
#
With larger datasets, you might initially want to set the number of folds for the cross validation to 3 instead of 5 (the default), so that there are fewer models to fit. AutoEmulate
takes a cross_validator
argument, which takes an scklearn cross validator or splitter. Let’s use kfold with 3 splits, which saves 2 model fits per model.
from sklearn.model_selection import KFold
start = time.time()
em = AutoEmulate()
em.setup(X, y, cross_validator=KFold(n_splits=3), print_setup=False)
em.compare()
end = time.time()
print(f"Time taken: {end - start} seconds")
Time taken: 15.135623931884766 seconds
4) modify hyperparameter search#
If we want to use hyperparameter search, we suddenly have to fit many more models. For each model, we might have 20 different parameter combinations, and because we cross validate each combination, we are running 20 * 5 = 100 model fits per model. It’s therefore recommended to focus on a few models of interest when using hyperparameter search.
To get a ballpark figure for how long hyperparameter search might take, we can run AutoEmulate
without hyperparameter search, and then multiply the time taken by the number of parameter combinations we want to try.
start = time.time()
em = AutoEmulate()
em.setup(X, y, print_setup=False)
em.compare()
end = time.time()
run_time = end - start
print(f"Time taken: {run_time} seconds")
Time taken: 26.507158994674683 seconds
The default number parameter combinations to search over (seeparam_search_iters
) is 20, so we can expect hyperparameter search to take 20 * run_time
seconds. Although this can be sped up by running in parallel or running fewer cross-validation folds, we are usually interested only in some emulator models anyway, which will speed up computation time. To figure out which models to optimise, let’s inspect the cv results from the training data.
em.summarise_cv()
model | short | rmse | r2 | |
---|---|---|---|---|
0 | RadialBasisFunctions | rbf | 0.000011 | 1.000000 |
1 | SecondOrderPolynomial | sop | 0.000011 | 1.000000 |
2 | GaussianProcess | gp | 5.686234 | 0.999072 |
3 | ConditionalNeuralProcess | cnp | 12.448831 | 0.995273 |
4 | SupportVectorMachines | svm | 66.633470 | 0.880780 |
5 | LightGBM | lgbm | 71.021354 | 0.869580 |
6 | GradientBoosting | gb | 78.972674 | 0.836586 |
7 | RandomForest | rf | 112.618563 | 0.666424 |
Now, we might like to see whether the Support Vector Machines model could do better with better hyperparameters. To minimise computation time, we only take those this one model, and run 50 iterations (instead of the default 20), we run them in parallel using 5 jobs.
start = time.time()
em_svm = AutoEmulate()
em_svm.setup(X, y, models=["svm"], param_search=True, param_search_iters=50,
n_jobs=5, print_setup=False)
em_svm.compare()
em_svm.summarise_cv()
end = time.time()
print(f"Time taken: {end - start} seconds")
Time taken: 0.8288900852203369 seconds
And indeed, the SVM model does perform better with better hyperparameters! Let’s have a look at them. SVM is a pipeline, so we first extract the ‘model’ step, and then get the parameters.
svm = em_svm.get_model(name="svm")
svm.named_steps['model'].get_params()
{'estimator__C': 2.7416066357879485,
'estimator__cache_size': 200,
'estimator__coef0': 0.28240475754651484,
'estimator__degree': 3,
'estimator__epsilon': 0.1023893458339658,
'estimator__gamma': 'auto',
'estimator__kernel': 'linear',
'estimator__max_iter': -1,
'estimator__normalise_y': True,
'estimator__shrinking': False,
'estimator__tol': 0.000143419387906774,
'estimator__verbose': False,
'estimator': SupportVectorMachines(C=2.7416066357879485, coef0=0.28240475754651484,
epsilon=0.1023893458339658, gamma='auto', kernel='linear',
shrinking=False, tol=0.000143419387906774),
'n_jobs': None}