Stage 2 : Model Development¶

In this section we will begin by looking at the second set of tasks of the project lifecycle model, which are concerned with the model development stage. For each task, a description is given as well as information about the importance of the typical activities associated with the task, including issues that have an ethical significance.

Preprocessing and Feature Engineering¶

Task Description¶

Data analysis can give rise to valuable insights (e.g. business intelligence), but not all the data types that have been collected will be appropriate to train ML algorithms. Therefore, preprocessing and feature engineering involves transforming the data into a form that is suitable for the next stage of the project lifecycle. This typically involves the cleaning, normalising, or otherwise refactoring of data into the features that will be used in model training and testing, as well as the features that may be used in the final system.

Features, therefore, may not be the same as the raw data that are collected in the prior stages. Rather, they may represent a combination of multiple data types, and as such may not always be interpretable to the end user.

Importance of Task¶

Features are dependant upon, but separate from, the raw data that are collected in the prior stages. They can be engineered by hand or through the use of algorithmic techniques to improve the performance of subsequent ML processes.

However, the features that are used in the process of model training, for instance, do not only affect the model's accuracy or predictive power, they also impact the ethical consequences of the project (e.g. reducing the explanatory potential of system, creating discriminatory outcomes). Therefore, selecting the best features is a vital, albeit often time-consuming and complicated task that can involve trade-offs about which parameter to optimise for (e.g. predictive power versus interpretability).

Illustrative Example¶

Health careEnvironmental Sciences

Illustrative examples coming soon!

Model Selection and Training¶

Task Description¶

This task involves the selection of a particular process (or algorithm) for training a model, and the training of the model itself.

There are many factors that feed into the decision of which model to select, including (but not limited to):

Access to computational resources (some learning algorithms require vast levels of computational power)
Predictive performance of model (as compared to other models)
Properties of underlying data (e.g. is the size of the dataset sufficient)

Model training is the process of fitting a statistical model to some training data. The process of training is typically iterative and proceeds by optimising the model's parameters (e.g. weights) to increasingly minimise the error between the model's predictions and the true values of the training data. The problem formulation task is important here because the target variable that was previously determined will guide the choice of model and the training process.

In addition, this task requires the project team to split their data into a training and testing set to prepare for the next task.

Importance of Task¶

There are, of course, many technical and logistical reasons for the responsible selecting and training of a model (e.g. ensuring parsimony, optimising performance).

However, a key concept in the responsible development of a model is the inherent interpretability and post hoc explainability of the model and the behaviour of the system into which it is implemented. Although there are nuances and exceptions, it is generally the case that more complex models are harder to interpret and explain (e.g. linear regression versus convolutional neural networks). Selecting the right technique, therefore, depends on the ultimate use case of the model and system.

In addition, the training process can be computationally intensive and may require the use of specialised hardware (e.g. GPUs). This gives rise to issues of sustainability and fairness, as the training process may be inaccessible to some individuals or groups.

Illustrative Example¶

Health careEnvironmental Sciences

Illustrative examples coming soon!

Model Testing and Validation¶

Task Description¶

Model testing and validation involves using the testing set from the previous task to evaluate the performance of the trained model. The evaluation of the model can be carried out against a variety of metrics, but typically includes the evaluation of the model's accuracy as applied to novel data (held out from the original training data).

This form of testing is sometimes known as internal validation, as it is carried out using a subset of the dataset that was used to train the model. In addition, the project team may also wish to evaluate the model's performance against entirely new data (external validation), which may be collected from a separate trial or even carried out by a separate team.

Importance of Task¶

Where a dataset is split into testing and training data, or where a model's performance is evaluated against wholly new data (e.g. external validation from a separate trial or project team), there are options to assess more than just the model's performance.

For instance, testing the generalisability of a model to a new domain or context can also help ensure the model is both sustainable and fair (e.g. has similar levels of accuracy or performance when validated externally). In addition, the model can be evaluated for its interpretability and explainability (e.g. how well the model's predictions can be explained by the features that were used to train it). If the model has low interpretability or explainability, then the project team may wish to consider retraining the model with a different set of features or employ some post hoc explanation techniques (e.g. LIME or SHAP).

Illustrative Example¶

Health careEnvironmental Sciences

Illustrative examples coming soon!

Model Documentation¶

Task Description¶

This task involves the documentation of both formal and non-formal properties of the model and the processes by which it was developed. This includes (but is not limited to):

Data sources and summary statistics
Model used (e.g. proprietary model purchased from vendor)
Model parameters (e.g. weights)
Evaluation metrics (e.g. model performance)
Model performance (e.g. accuracy)
Model limitations (e.g. bias)
Model assumptions (e.g. normality of data)

The categories of information that are documented will depend on the project's requirements. For example, if the project is part of an academic research project, then the team may wish to follow the standard format of a scientific paper or a pre-specified protocol. In contrast, if the project is part of a commercial development project, then the team may have additional requirements based on a procurement process or other contractual obligations.

Model Cards for Model Reporting

See Mitchell et al. (2018) Model Cards for Model Reporting for one proposed template that project teams can use as a starting point for model documentation. Teams may wish to adapt this template to suit their own needs.

Importance of Task¶

Clear and accessible documentation is an important form of responsible project governance for the following reasons:

In research projects it ensures reproducibility and replicability of results, as well as other values associated with open research, such as public accessibility.
In commercial projects it ensures accountability and transparency of decision-making processes, which may be required by law or by contractual obligations.
In can help affected individuals seek redress for any harms that may arise from the design, development, or deployment of data-driven technologies.
It allows project team members, who may be responsible for downstream tasks, to review and reflect on the project's progress and outcomes so far.

Thinking further

What other reasons are there for clear and accessible documentation?

Illustrative Example¶

Health careEnvironmental Sciences

Illustrative examples coming soon!