First-Time Users’ Frequently Asked Questions

First-Time Users’ Frequently Asked Questions#

General Questions#

What is AutoEmulate?
- A Python package that makes it easy to create emulators for complex simulations. It takes a set of simulation inputs X and outputs y, and automatically fits, optimises and evaluates various machine learning models to find the best emulator model. The emulator model can then be used as a drop-in replacement for the simulation, but will be much faster and computationally cheaper to evaluate. We have also implemented global sensitivity analysis as a common emulator application and working towards making AutoEmulate a true end-to-end package for building emulators.
How do I know whether AutoEmulate is the right tool for me?
- You need to build an emulator for a simulation.
- You want to do global sensitivity analysis
- Your inputs X and outputs y are numeric and complete (we don’t support missing data yet).
- You have one or more input parameters and one or more output variables.
- You have a small-ish dataset in the order of hundreds to few thousands of samples. All default emulator parameters and search spaces are optimised for smaller datasets.
Does AutoEmulate support multi-output data?
- Yes, all models support multi-output data. Some do so natively, others are wrapped in a MultiOutputRegressor, which fits one model per target variable.
Does AutoEmulate support temporal or spatial data?
- Not explicitly. The train-test split just takes a random subset as a test set, so does KFold cross-validation.
AutoEmulate takes a long time to run on my dataset, why?
- The package fits a lot of models, in particular when hyperparameters are optimised. With say 8 default models and 5-fold cross-validation, this amounts to 40 model fits. With the addition of hyperparameter optimisation (n_iter=20), this results in 800 model fits. Some models such as Gaussian Processes and Neural Processes will take a long time to run on a CPU. However, don’t despair! There is a speeding up AutoEmulate guide. As a rule of thumb, if your dataset is smaller than 1000 samples, you should be fine, if it’s larger and you want to optimise hyperparameters, you might want to read the guide.

Usage Questions#

What data do I need to provide to AutoEmulate to build an emulator?
- You’ll need two input objects: X and y. X is an ndarray / Pandas DataFrame of shape (n_samples, n_parameters) and y is an ndarray / Pandas DataFrame of shape (n_samples, n_outputs). Each sample here is a simulation run, so each row of X corresponds to a set of input parameters and each row of y corresponds to the corresponding simulation output. You’ll usually have created X using Latin hypercube sampling or similar methods, and y by running the simulation on these X inputs.
Can I use AutoEmulate for commercial purposes?
- Yes. It’s licensed under the MIT license, which allows for commercial use. See the license for more information.

Advanced Usage#

Does AutoEmulate support parallel processing or high-performance computing (HPC) environments?
- Yes, AutoEmulate.setup() has an n_jobs parameter which allows to parallelise cross-validation and hyperparameter optimisation. We are also working on GPU support for some models.
Can AutoEmulate be integrated with other data analysis or simulation tools?
- AutoEmulate takes simple X and y ndarrays as input, and returns emulators which are scikit-learn estimators, that can be saved and loaded, and used like any other scikit-learn model.

Data Handling#

What are the best practices for data preprocessing before using AutoEmulate?
- The user will typically run their simulation on a selected set of input parameters (-> experimental design) using a latin hypercube or other sampling method. AutoEmulate currently needs all inputs to be numeric and we don’t support missing data. By default, AutoEmulate will scale the input data to zero mean and unit variance, and for some models it will also scale the output data. There’s also the option to do dimensionality reduction in setup().

Troubleshooting#

What common issues might I encounter when using AutoEmulate, and how can I solve them?
- AutoEmulate.setup() has a log_to_file option to log all warnings/errors to a file. It also has a verbose option to print more information to the console. If you encounter an error, please open an issue (see below).
- One common issue is that the Jupyter notebook kernel crashes when running compare() in parallel, often due to LightGBM. In this case, we recommend either specifying n_jobs=1 or selecting specific (non-LightGBM) models in setup() with the models parameter.
How can I report a bug or request a feature in AutoEmulate?
- You can report a bug or request a new feature through the issue templates in our GitHub repository. Head on over there and choose one of the templates for your purpose and get started.

Community and Learning Resources#

Where can I find tutorials or case studies on using AutoEmulate?
- See the tutorial for a comprehensive guide on using the package. Case studies are coming soon.
How can I stay updated on new releases or updates to AutoEmulate?
- Watch the AutoEmulate repository.
What support options are available if I need help with AutoEmulate?
- Please open an issue or start a discussion on GitHub.

First-Time Users’ Frequently Asked Questions

Contents

First-Time Users’ Frequently Asked Questions#

General Questions#

Usage Questions#

Advanced Usage#

Data Handling#

Troubleshooting#

Community and Learning Resources#