autoemulate.compare#
- class AutoEmulate[source]#
Bases:
object
The AutoEmulate class is the main class of the AutoEmulate package. It is used to set up and compare different emulator models on a given dataset. It can also be used to summarise and visualise results, to save and load models and to run sensitivity analysis.
- setup(X, y, param_search=False, param_search_type='random', param_search_iters=20, test_set_size=0.2, scale=True, scaler=StandardScaler(), reduce_dim=False, dim_reducer=PCA(), scale_output=True, scaler_output=StandardScaler(), reduce_dim_output=False, preprocessing_methods=None, cross_validator=KFold(n_splits=5, random_state=79267, shuffle=True), n_jobs=None, models=None, verbose=0, log_to_file=False, print_setup=True)[source]#
Sets up the automatic emulation.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Simulation input.
y (array-like, shape (n_samples, n_outputs)) – Simulation output.
param_search (bool) – Whether to perform hyperparameter search over predifined parameter grids.
param_search_type (str) – Type of hyperparameter search to perform. Currently only “random”, which picks random parameter settings from a grid param_search_iters times.
param_search_iters (int) – Number of parameter settings that are sampled. Only used if param_search=True.
scale (bool, default=True) – Whether to scale features/parameters in X before fitting the models using a scaler.
scaler (sklearn.preprocessing.StandardScaler) – Scaler to use. Defaults to StandardScaler. Can be any sklearn scaler.
reduce_dim (bool, default=False) – Whether to reduce the dimensionality of the data before fitting the models.
dim_reducer (sklearn.decomposition object) – Dimensionality reduction method to use. Can be any method in sklearn.decomposition with an n_components parameter. Defaults to PCA. Specify n_components like so: dim_reducer=PCA(n_components=2) for choosing 2 principal components, or dim_reducer=PCA(n_components=0.95) for choosing the number of components that explain 95% of the variance. Other methods can have slightly different n_components parameter inputs, see the sklearn documentation for more details. Dimension reduction is always performed after scaling.
scale_output (bool) – Whether to scale the output data.
scaler_output (sklearn.preprocessing.StandardScaler) – Scaler to use. Defaults to StandardScaler. Can be any sklearn scaler.
reduce_dim_output (bool) – Whether to reduce the dimensionality of the output data.
preprocessing_methods (dict) – List of dictionaries with preprocessing methods and their parameters. Dimensionality reduction method to use for outputs. Can be PCA or Variational Autoencoder. #TODO: set PCA as default.
cross_validator (sklearn.model_selection object) – Cross-validation strategy to use. Defaults to KFold with 5 splits and shuffle=True. Can be any object in sklearn.model_selection that generates train/test indices.
n_jobs (int) – Number of jobs to run in parallel. None means 1, -1 means using all processors.
models (list) – str or list of model names. If None, uses a set of core models.
verbose (int) – Verbosity level. Can be 0, 1, or 2.
log_to_file (bool) – Whether to log to file.
print_setup (bool) – Whether to print the setup of the AutoEmulate object.
- compare()[source]#
Compares models using cross-validation, with the option to perform hyperparameter search. self.setup() must be run first.
- Returns:
self.best_model – Emulator with the highest cross-validation R2 score.
- Return type:
- get_model(name=None, rank=None, preprocessing=None, metric='r2')[source]#
Get a fitted model by name or rank, optionally from specific preprocessing.
- Parameters:
- Return type:
Fitted model instance
- get_best_model_for_prep(prep_results, metric='r2')[source]#
Get the best model for a specific preprocessing method.
- refit(model=None)[source]#
Refits model on full data. This is useful, as compare() runs only on the training data. :param model: :type model: model to refit.
- Returns:
model – Refitted model.
- Return type:
- save(model=None, path=None)[source]#
Saves model to disk.
- Parameters:
model (sklearn model, optional) – Model to save. If None, saves the model with the best average cv score.
path (str) – Path to save the model.
- summarise_cv(model=None, preprocessing=None, sort_by='r2')[source]#
Summarise cross-validation results across models and preprocessing methods.
- Parameters:
model (str, optional) – Name of the model to get cv results for (can be full name like “GaussianProcess” or short name like “gp”). If None, summarises results for all models.
preprocessing (str, optional) – Name of preprocessing method to filter by. If None, includes all methods.
sort_by (str, optional) – The metric to sort by. Default is “r2”, can also be “rmse”.
- Returns:
DataFrame summarising cv results with preprocessing information.
- Return type:
- summarize_cv(model=None, preprocessing=None, sort_by='r2')#
Summarise cross-validation results across models and preprocessing methods.
- Parameters:
model (str, optional) – Name of the model to get cv results for (can be full name like “GaussianProcess” or short name like “gp”). If None, summarises results for all models.
preprocessing (str, optional) – Name of preprocessing method to filter by. If None, includes all methods.
sort_by (str, optional) – The metric to sort by. Default is “r2”, can also be “rmse”.
- Returns:
DataFrame summarising cv results with preprocessing information.
- Return type:
- plot_cv(model=None, preprocessing=None, style='Xy', n_cols=3, figsize=None, output_index=0, input_index=0)[source]#
Plots the results of the cross-validation for a specific preprocessing method.
- Parameters:
model (str) – Name of the model to plot. If None, plots best folds of each model.
preprocessing (str) – Name of preprocessing method to plot. If None, uses the best preprocessing method. Use ‘None’ (string) for no preprocessing.
style (str, optional) – The type of plot to draw.
n_cols (int) – Maximum number of columns in the plot grid.
figsize (tuple, optional) – Overrides the default figure size.
output_index (int) – Index of the output to plot.
input_index (int) – Index of the input to plot.
- Returns:
The figure object containing the plot.
- Return type:
- evaluate(model=None, preprocessing=None, multioutput='uniform_average')[source]#
Evaluates the model on the test set, handling preprocessing transformations if any.
- Parameters:
model (object, optional) – Fitted model to evaluate. If None, uses the best model from comparison.
preprocessing (str, optional) – Name of preprocessing method used for the model. If None, uses the best preprocessing method from comparison or assumes no preprocessing if none was specified.
multioutput (str, optional) – Defines aggregating of multiple output scores: - ‘raw_values’ : returns scores for each target - ‘uniform_average’ : scores are averaged uniformly - ‘variance_weighted’ : scores are averaged weighted by their individual variances
- Returns:
DataFrame containing the model scores on the test set with columns: - model: Model name - short: Short model name - preprocessing: Preprocessing method used (if any) - target: Target/output name (if multioutput=’raw_values’) - metric scores (e.g., r2, rmse)
- Return type:
- plot_eval(model=None, preprocessing=None, style='Xy', n_cols=3, figsize=None, output_index=0, input_index=0)[source]#
Visualise model predictive performance on the test set.
- Parameters:
model (object) – Fitted model.
style (str, optional) – The type of plot to draw: “Xy” observed and predicted values vs. features, including 2σ error bands where available (default). “actual_vs_predicted” draws the observed values (y-axis) vs. the predicted values (x-axis) (default). “residual_vs_predicted” draws the residuals, i.e. difference between observed and predicted values, (y-axis) vs. the predicted values (x-axis).
n_cols (int, optional) – Maximum number of columns in the plot grid for multi-output. Default is 3.
output_index (list, int) – Index of the output to plot. Either a single index or a list of indices.
input_index (list, int) – Index of the input to plot. Either a single index or a list of indices. Only used if style=”Xy”.
- sensitivity_analysis(model=None, problem=None, N=1024, conf_level=0.95, as_df=True)[source]#
Perform Sobol sensitivity analysis on a fitted emulator.
Sobol sensitivity analysis is a variance-based method that decomposes the variance of the model output into contributions from individual input parameters and their interactions. It calculates: - First-order indices (S1): Direct contribution of each input parameter - Second-order indices (S2): Contribution from pairwise interactions between parameters - Total-order indices (ST): Total contribution of a parameter, including all its interactions
- Parameters:
model (object, optional) – Fitted model. If None, uses the best model from cross-validation.
problem (dict, optional) –
The problem definition dictionary. If None, the problem is generated from X using minimum and maximum values of the features as bounds. The dictionary should contain:
’num_vars’: Number of input variables (int)
’names’: List of variable names (list of str)
’bounds’: List of [min, max] bounds for each variable (list of lists)
’output_names’: Optional list of output names (list of str)
Example:
problem = { "num_vars": 2, "names": ["x1", "x2"], "bounds": [[0, 1], [0, 1]], "output_names": ["y1", "y2"] # optional }
N (int, optional) – Number of samples to generate for the analysis. Higher values give more accurate results but increase computation time. Default is 1024.
conf_level (float, optional) – Confidence level (between 0 and 1) for calculating confidence intervals of the sensitivity indices. Default is 0.95 (95% confidence).
as_df (bool, optional) – If True, returns results as a long-format pandas DataFrame with columns for parameters, sensitivity indices, and confidence intervals. If False, returns the raw SALib results dictionary. Default is True.
- Returns:
If as_df=True (default), returns a DataFrame with columns:
’parameter’: Input parameter name
’output’: Output variable name
’S1’, ‘S2’, ‘ST’: First, second, and total order sensitivity indices
’S1_conf’, ‘S2_conf’, ‘ST_conf’: Confidence intervals for each index
If as_df=False, returns the raw SALib results dictionary.
- Return type:
Notes
The analysis requires N * (2D + 2) model evaluations, where D is the number of input parameters. For example, with N=1024 and 5 parameters, this requires 12,288 evaluations.
- plot_sensitivity_analysis(results, index='S1', n_cols=None, figsize=None)[source]#
Plot the sensitivity analysis results.
Parameters:#
- resultspd.DataFrame
The results from sobol_results_to_df.
- indexstr, default “S1”
The type of sensitivity index to plot. - “S1”: first-order indices - “S2”: second-order/interaction indices - “ST”: total-order indices
- n_colsint, optional
The number of columns in the plot. Defaults to 3 if there are 3 or more outputs, otherwise the number of outputs.
- figsizetuple, optional
Figure size as (width, height) in inches.If None, automatically calculated.