{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Minimising computation time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`AutoEmulate` can be slow if the input data has many observations (rows) or many output variables. By default, `AutoEmulate` cross-validates each model, so we're computing 5 fits per models. The computation time will be relatively short for datasets up to a few thousands of datapoints, but some models (e.g. Gaussian Processes) don't scale well, so computation time might quickly become an issue. \n",
    "\n",
    "In this tutorial we walk through four strategies to speed up `AutoEmulate`:\n",
    "\n",
    "1) parallise model fits using `n_jobs` \n",
    "2) restrict the range of models using the `models` argument \n",
    "3) run fewer cross validation folds using `cross_validator` \n",
    "4) for hyperparameter search:\n",
    "    - all of the above\n",
    "    - run fewer iterations using `param_search_iters`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/mstoffel/turing/projects/autoemulate/autoemulate/compare.py:8: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
      "  from tqdm.autonotebook import tqdm\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import make_regression\n",
    "from autoemulate.compare import AutoEmulate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((500, 10), (500, 5))"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y = make_regression(n_samples=500, n_features=10, n_targets=5)\n",
    "X.shape, y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And see how long `AutoEmulate` takes to run (without hyperparameter search)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<p>AutoEmulate is set up with the following settings:</p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Values</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Simulation input shape (X)</th>\n",
       "      <td>(500, 10)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Simulation output shape (y)</th>\n",
       "      <td>(500, 5)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Proportion of data for testing (test_set_size)</th>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Scale input data (scale)</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Scaler (scaler)</th>\n",
       "      <td>StandardScaler</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Do hyperparameter search (param_search)</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reduce dimensionality (reduce_dim)</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cross validator (cross_validator)</th>\n",
       "      <td>KFold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Parallel jobs (n_jobs)</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "87557f74fe6c4e46b17065237f9a6397",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 33.17510414123535 seconds\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "start = time.time()\n",
    "\n",
    "em = AutoEmulate()\n",
    "em.setup(X, y)\n",
    "em.compare()\n",
    "\n",
    "end = time.time()\n",
    "print(f\"Time taken: {end - start} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1) parallise model fits using `n_jobs`\n",
    "The n_jobs parameter allows you to specify the number of CPU cores to use for parallel processing. Setting n_jobs = -1  uses all available cores, speeding up computations when working with large datasets.\n",
    "\n",
    "Note: Maxing out all available cores might not always lead to faster computation times. Due to overhead from parallelization, memory bandwidth limitations, and potential load imbalances, using more cores can sometimes result in diminishing returns or even slower performance.\n",
    "\n",
    "Here we accomplish a speed-up by setting n_jobs to 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "930891bede6648d59f8376def60235ca",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 19.856790781021118 seconds\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "em = AutoEmulate()\n",
    "em.setup(X, y, n_jobs=5, print_setup=False)\n",
    "em.compare()\n",
    "\n",
    "end = time.time()\n",
    "print(f\"Time taken: {end - start} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2) restrict the range of models\n",
    "\n",
    "Another approach is to limit the range of models by selecting a subset of relevant types based on your domain and problem expertise. This selection process typically considers factors such as the nature of the problem, data characteristics or the need for interpretability. By narrowing down the types of models, you can reduce the computational burden and focus on the most promising architectures for your specific task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8db0616a71fd46fb8b544b6992466cfe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 2.7409257888793945 seconds\n"
     ]
    }
   ],
   "source": [
    "em = AutoEmulate()\n",
    "em.setup(X, y, print_setup=False)\n",
    "\n",
    "# let's see all models, which we can refer to by short or full name\n",
    "em.model_registry.get_model_names()\n",
    "\n",
    "# setup with fewer models\n",
    "start = time.time()\n",
    "\n",
    "em.setup(X, y, models=[\"sop\", \"rbf\", \"gb\"], print_setup=False)\n",
    "em.compare()\n",
    "\n",
    "end = time.time()\n",
    "print(f\"Time taken: {end - start} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3) reduce the number of folds in cross validation using `cross_validator` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With larger datasets, you might initially want to set the number of folds for the cross validation to 3 instead of 5 (the default), so that there are fewer models to fit. `AutoEmulate` takes a `cross_validator` argument, which takes an scklearn cross validator or [splitter](https://scikit-learn.org/stable/api/sklearn.model_selection.html). Let's use kfold with 3 splits, which saves 2 model fits per model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0b1bdebde6bd44b099be93cde2ee0e38",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 15.135623931884766 seconds\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import KFold\n",
    "\n",
    "start = time.time()\n",
    "\n",
    "em = AutoEmulate()\n",
    "em.setup(X, y, cross_validator=KFold(n_splits=3), print_setup=False)\n",
    "em.compare()\n",
    "\n",
    "end = time.time()\n",
    "print(f\"Time taken: {end - start} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4) modify hyperparameter search\n",
    "\n",
    "If we want to use hyperparameter search, we suddenly have to fit many more models. For each model, we might have 20 different parameter combinations, and because we cross validate each combination, we are running 20 * 5 = 100 model fits per model. It's therefore recommended to focus on a few models of interest when using hyperparameter search.\n",
    "\n",
    "To get a ballpark figure for how long hyperparameter search might take, we can run `AutoEmulate` without hyperparameter search, and then multiply the time taken by the number of parameter combinations we want to try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ee1e27685cb74000b4fb19f311e0a0e0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 26.507158994674683 seconds\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "em = AutoEmulate()\n",
    "em.setup(X, y, print_setup=False)\n",
    "em.compare()\n",
    "\n",
    "end = time.time()\n",
    "run_time = end - start\n",
    "print(f\"Time taken: {run_time} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default number parameter combinations to search over (see`param_search_iters`) is 20, so we can expect hyperparameter search to take `20 * run_time` seconds. Although this can be sped up by running in parallel or running fewer cross-validation folds, we are usually interested only in some emulator models anyway, which will speed up computation time. To figure out which models to optimise, let's inspect the cv results from the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>model</th>\n",
       "      <th>short</th>\n",
       "      <th>rmse</th>\n",
       "      <th>r2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RadialBasisFunctions</td>\n",
       "      <td>rbf</td>\n",
       "      <td>0.000011</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SecondOrderPolynomial</td>\n",
       "      <td>sop</td>\n",
       "      <td>0.000011</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GaussianProcess</td>\n",
       "      <td>gp</td>\n",
       "      <td>5.686234</td>\n",
       "      <td>0.999072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ConditionalNeuralProcess</td>\n",
       "      <td>cnp</td>\n",
       "      <td>12.448831</td>\n",
       "      <td>0.995273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SupportVectorMachines</td>\n",
       "      <td>svm</td>\n",
       "      <td>66.633470</td>\n",
       "      <td>0.880780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>LightGBM</td>\n",
       "      <td>lgbm</td>\n",
       "      <td>71.021354</td>\n",
       "      <td>0.869580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>GradientBoosting</td>\n",
       "      <td>gb</td>\n",
       "      <td>78.972674</td>\n",
       "      <td>0.836586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>RandomForest</td>\n",
       "      <td>rf</td>\n",
       "      <td>112.618563</td>\n",
       "      <td>0.666424</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      model short        rmse        r2\n",
       "0      RadialBasisFunctions   rbf    0.000011  1.000000\n",
       "1     SecondOrderPolynomial   sop    0.000011  1.000000\n",
       "2           GaussianProcess    gp    5.686234  0.999072\n",
       "3  ConditionalNeuralProcess   cnp   12.448831  0.995273\n",
       "4     SupportVectorMachines   svm   66.633470  0.880780\n",
       "5                  LightGBM  lgbm   71.021354  0.869580\n",
       "6          GradientBoosting    gb   78.972674  0.836586\n",
       "7              RandomForest    rf  112.618563  0.666424"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "em.summarise_cv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we might like to see whether the Support Vector Machines model could do better with better hyperparameters. To minimise computation time, we only take those this one model, and run 50 iterations (instead of the default 20), we run them in parallel using 5 jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f733d145401f4685aae6c2658f22a379",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Initializing:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 0.8288900852203369 seconds\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "em_svm = AutoEmulate()\n",
    "em_svm.setup(X, y, models=[\"svm\"], param_search=True, param_search_iters=50, \n",
    "         n_jobs=5, print_setup=False)\n",
    "em_svm.compare()\n",
    "em_svm.summarise_cv()\n",
    "\n",
    "end = time.time()\n",
    "print(f\"Time taken: {end - start} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And indeed, the SVM model does perform better with better hyperparameters! Let's have a look at them. SVM is a pipeline, so we first extract the 'model' step, and then get the parameters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'estimator__C': 2.7416066357879485,\n",
       " 'estimator__cache_size': 200,\n",
       " 'estimator__coef0': 0.28240475754651484,\n",
       " 'estimator__degree': 3,\n",
       " 'estimator__epsilon': 0.1023893458339658,\n",
       " 'estimator__gamma': 'auto',\n",
       " 'estimator__kernel': 'linear',\n",
       " 'estimator__max_iter': -1,\n",
       " 'estimator__normalise_y': True,\n",
       " 'estimator__shrinking': False,\n",
       " 'estimator__tol': 0.000143419387906774,\n",
       " 'estimator__verbose': False,\n",
       " 'estimator': SupportVectorMachines(C=2.7416066357879485, coef0=0.28240475754651484,\n",
       "                       epsilon=0.1023893458339658, gamma='auto', kernel='linear',\n",
       "                       shrinking=False, tol=0.000143419387906774),\n",
       " 'n_jobs': None}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm = em_svm.get_model(name=\"svm\")\n",
    "svm.named_steps['model'].get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "autoemulate-_SyXUh_0-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}