Project Transparency¶
A team of lawyers are building a case and are at the stage of undertaking discovery—a process of gathering information about a case that typically requires a legal team to request information from another company or organisation (e.g. the legal team representing the other party in a dispute). One of the areas of concern is about access to information regarding the use of an algorithmic decision-making system, which was used to make a decision that is subject to dispute in this case. They make a request to see information about the algorithm, including how it was designed, who was involved, and how it works.
When the team of lawyers receive the requested information, there are two problems:
- The other team have sent across mountains of documents and files, in an attempt to bury any incriminating information (e.g. sending over thousands of hours of transcripts of meetings, emails, and other documents).
- Within this documentation there is information about the structure of the algorithm, but it is written in technical jargon that is hard for the lawyers to understand.
This hypothetical scenario introduces two concepts of relevance to both explainability in general, and project transparency in particular:
- Transparency is not the same as accessibility. In the example above, the second team of lawyers have made information "accessible" to the first team in principle but have done so in a way that few would argue is transparent in a meaningful way.1
- Transparency is necessary for explainability. The first team of lawyers will need to be able to explain the algorithm's structure in court in a way that is understandable to the judge and jury. Developing this explanation requires a certain degree of transparency in the documentation about how the algorithm was designed, developed, and deployed.
As with many of the SAFE-D principles, explainability is closely related to and overlaps with other neighbouring concepts, such as accessibility and transparency.
In the course of this section we will look at how these related concepts intersect, some practical choices and mechanisms by which project transparency and accessibility can be achieved, and how transparency and accessibility can improve and enhance explainability. Before we do this, however, let's consider a question that raises other questions about what we are trying to explain and where we may need transparency.
What are we trying to explain? Where do we need greater transparency?¶
Consider the following scenario.2
A team of data analysts who work for a travel booking website are asked to explain why a model has altered its predictions about customer purchasing behaviour.
This time, the model is used to drive a recommender system, which shows holiday packages to customers based on its predictions about which are most likely to be purchased.
Perhaps the system is recommending significantly more trips to beach resorts, whereas previously it was recommending ski trips.
Here, if the features used by the model were investigated it would be easy enough to identify that season
is a feature with high importance for the model.
It is well known that customers alter their purchasing behaviour between seasons (e.g. Winter, Summer).3
From this we could explain the change in the model's predictions, as a result of a significant change in the data distribution, which itself is a representation of a change in the underlying phenomena (i.e. changing seasons).
Simple enough.
But now let's assume that there is another change, which results in a significant drop in conversion rate (i.e. the ratio of the number of people who view, say, a holiday deal, to the number who actually purchase the holiday suddenly drops).
That is, customers are not just booking different holidays, they are not booking as many holidays at all.
Again, this may seem like another case where there is a need to explain the model's behaviour in terms of an underlying shift in the data distribution, which is in turn representative of some change in the underlying phenomena.
However, this time, let's pretend that the problem turns out to be a fault with a third-party piece of software, used as a dependency in the team's data pipeline, which is now causing the data about a user's location
to be incorrectly recorded.
As it turns out, the company's model has learned that those who live in affluent neighbourhoods are more likely to purchase more expensive packages, and the company's recommendation system uses this to show customers holidays that are in their predicted price range, or dynamically alter the price of holiday packages based on their estimated "willingness-to-pay"—two ethically dubious practices known as personalised and dynamic pricing4.
However, due to the aforementioned fault in the data pipeline, all customers are now being shown the same, more expensive, holiday deals because their postcodes
are all being recorded as all being located in affluent neighbourhoods.
As such, fewer customers are purchasing their packages, because they cannot afford them, and the conversion rate has dropped.
Again, there is no fault with the model (or its parameters). Rather, the target of any explanation lies in the data and the generative mechanisms responsible for producing the data. The model is still making the same predictions, but the predictions are now incorrect and the recommender system is now unable to recommend the correct holiday packages to customers.
This hypothetical example exposes an important point about transparency and explainability: the locus of our explanation will not always be the model. Instead, the focus of an explanation may be the data pipeline used to drive the model's predictions or other parts of the system, such as the user interface. As such, the sort of transparency that we are interested in is not merely the transparency of the model itself, but rather the transparency of the project (and system) as a whole. We will look at model interpretability techniques in the next section, some of which can also help project teams debug or identify the source of issues at other points in a project's lifecycle.
What about the transparency of the learning algorithm?
Related to the previous example, it is also possible that when trying to diagnose or triage issues to explain a model's (perhaps unexpected) behaviour, the algorithm by which a model was trained can be an informative source when constructing an explanation. For instance, understanding how a convolutional neural network learns to classify images (e.g. the initial layers that extract predictive features) may help to diagnose (and subsequently explain) why a model is misclassifying images.
A famous example here is the instance of an algorithm that learned to classify a picture of a husky as a wolf because of the presence of snow in the background of the images. The algorithm learned that snow in the background was a good predictor of an image being of a wolf—as most wolves in the training set were show against a snowy backdrop.5 However, an over-reliance on this feature created a spurious correlation that did not hold true for the actual subjects of the images (e.g. huskies).
The type of transparency that is required here pertains to the algorithmic process by which the predictive model is trained and developed.
As you may tell from the above two examples, the nature of our problem will determine the location of the desired transparency and the locus of our explanation. However, this will largely remain a context-dependent issue, and so in this section we will take a simpler approach by advocating for global project transparency by default. And, where there are overriding factors such as intellectual property concerns, or sensitive information disclosure issues, these can be evaluated by a project team on an ad hoc basis.
What does responsible project transparency look like?¶
In a previous module (required for this module) we introduced the project lifecycle model. This model is a useful framework for thinking about the different stages of a project, and the different types of transparency that may be required at each stage. Rather than going through each stage separately, we will instead group the stages (and their corresponding tasks) into several categories that can help us understand the practical mechanisms and processes that are likely to be required to achieve transparency in a project.
- Tasks that involve choices about how a project should be governed (e.g. defining the nature of the problem that a data-driven technology is designed to address and the algorithmic procedure by which it is implemented)
- Tasks that involve what we can term 'data stewardship' (i.e. the management of data and the data pipeline)
- Tasks that involve the engagement of stakeholders (e.g. members of the public, or other professionals in the relevant domain).
Starting with the first category, the following tasks are significant (although not exhaustive) examples of decisions or actions that can have an impact on explainability, and, therefore, require certain levels of transparency:
- Determining the Problem the System is Designed to Address: this task includes information about why the problem is important and why the technical description (e.g. translation of the set of input variable into target variables) is adequate for the problem at hand. For instance, why a set of features about a candidate are adequate measures for assessing their
suitability for a job role
. Aside from the technical "solution" to the problem, there is also a social dimension that needs to be justified, such as why an automated system is appropriate for use in hiring decisions (e.g. demonstrating that the system is not biased against protected groups) - Identification and Mitigation of Biases: decisions about which biases may be relevant for the project and why any chosen mitigation strategies are sufficient to address them.
Tasks such as these ones may pose challenging questions during the process of deliberation, but when it comes to project transparency, clear and accessible documentation is usually sufficient to ensure good levels of project transparency.
The second category of tasks, data stewardship, is also important for explainability, and the following are illustrative examples of tasks that would require appropriate levels of transparency:
- Data Provenance: the source of the data used throughout the project lifecycle has myriad implications for explainability, including explanations about how the quality of the data was evaluated, or how the legal basis for its use was established. Clear and accessible documentation on data sources will therefore be crucial for project transparency.
- Data Pipeline Engineering: many stakeholders will be interested in assessing the safety and security of data, especially where it contains sensitive or personal information. Being able to provide explanations about how the data pipeline was constructed, therefore, will be important for explainability.
- Exploratory and Confirmatory Data Analysis: although data analysis is highly iterative, a clear record of the analysis techniques employed and the rationale for their use can help ensure a high level of transparency.
Many data scientists, analysts, and engineers are already familiar with the use of notebooks for documenting data processing and analysis (e.g. Jupyter). And, similarly the use of open access repositories for storing and sharing code and data is also becoming more common—guides and templates are available to support those who may be unfamiliar with transparency best practices of this kind.
Finally, the third category emphasises the importance of accessibility for those who are directly or indirectly involved with a project:
- Stakeholder Identification: choices made about which stakeholders are relevant to the project, and by proxy which are irrelevant or non-significant, will affect the types of explanations that are prioritised (e.g. members of the public or professionals in the relevant domain). This is not limited to those who are involved with, say, participatory design workshops early on to establish high-level constraints. Rather, it can extend across the whole project lifecycle to include, say, users of the system being developed, to ensure they are able to access information needed to carry out their associated responsibilities.
- External Auditing: some projects will require auditing as a pre-requisite for the deployment of a system, whether by a separate regulatory body or by a partner organisation. The degree of transparency required will depend on the nature of the audit, and so early engagement and assessment of the transparency needs and the basis for any requested explanations can help to ensure that the project team is able to meet the requirements of the audit.
Questions
- What other tasks can you think of, which may occur during one of the project lifecycle stages, that would require transparency?
- How would this transparency be achieved and how would it contribute to explaining any decisions or actions taken?
Limits of Transparency¶
Like explainability, transparency is a nuanced concept and it is important to recognise that there are limits to what can be achieved and what may be desirable.
Firstly, while this section has proposed examples of mechanisms and processes for documenting and sharing information about a project and the associated tasks, it should be recognised that these tasks add additional resource demands to the project. This can create (sometimes insurmountable) barriers for smaller teams, ranging from the obvious ones—not having sufficient time to accommodate while also meeting deadlines—to more subtle (but nonetheless important) ones, such as the potential for increased stress and burnout, especially for more junior members of the team that may be expected to take on these types of tasks.6 Therefore, as with some of the other SAFE-D principles, it is important to adopt a meta-principle of proportionality when determining the extent to which a principle should be applied. While doing the bare minimum is rarely an ethically praiseworthy approach, it is also the case that spending too much time on a task that is not proportionate to the potential impact of the decision or action in question can be a waste of resources.
Secondly, for some projects, transparency may not be possible for a variety of reasons: intellectual property restrictions, legal restrictions on the disclosure of sensitive information and the need to protect the privacy of individuals, or security concerns about the potential for malicious actors to exploit the information being shared. In cases such as these, it is important to remember that the SAFE-D principles are not absolute rules—they are guides and starting points for further reflection. Providing overriding reasons, therefore, for why data have not been made accessible or why some decisions about a project's governance have not been disclosed publicly, is fine—and may even be the appropriate form of transparency (i.e. transparency of reasons).
-
Put aside the question of whether the second team of lawyers are acting in a responsible manner here, as it would be easy to argue that they are upholding their professional duties to their client by sending over mountains of evidence. ↩
-
This scenario is inspired by and adapted from this article: Lazaridis, D. (2021). Explainable AI (XAI) and interpretable machine learning (IML) models. Ambiata. Accessed: January 22, 2023. ↩
-
Admittedly, this could probably have been inferred without the expertise of a data analyst. ↩
-
This example refers to a practice known as 'personalised pricing', or sometimes 'price discrimination'. Neither are new practices (see here), but the widespread use of algorithmic techniques is enabling more dynamic and hyper-personalised forms of both personalised pricing and price discrimination (see this article). ↩
-
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144). https://dl.acm.org/doi/abs/10.1145/2939672.2939778 ↩
-
See this article on "glue work" for an illuminating discussion about the unfair distribution of tasks within projects, and the professional development barriers this can create for specific groups (e.g. females, PhD students) in disciplines such as data science or software engineering. ↩