8.1. Generalised linear models¶

This section will focus on poisson regression, which is a generalised linear model (GLM) for count data.

8.1.1. Count data¶

In previous chapters, we learnt about linear regression for predicting quantitative (continuous) variables and logistic regression for predicting qualitative (categorical) variables. In this section, we will learn about regression for predicting count data. Count data is a special type of discrete data, where the values are non-negative integers. Examples of count data include the number of cars passing a toll booth, the number of people in a room, the number of times a word appears in a document, the number of times a user clicks on a link, and the number of times a user purchases a product.

To study the problem of predicting count data, we will use the Bikeshare dataset. The response is bikers, the number of hourly users of a bike sharing program in Washington, DC. Thus, this response variable is neither qualitative nor quantitative. Instead, it takes on non-negative integer values, or counts. We will consider predicting bikers using the features mnth (month of the year), hr (hour of the day, from 0 to 23), workingday (an indicator variable that equals 1 if it is neither a weekend nor a holiday), temp (the normalised temperature, in Celsius), and weathersit (a qualitative variable that takes on one of four possible values: clear; misty or cloudy; light rain or light snow; or heavy rain or heavy snow).

Get ready by importing the APIs needed from respective libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.formula.api import ols, poisson

%matplotlib inline

Load the data, indicate the type of the mnth and hr features, and display the first few rows.

data_url = "https://github.com/pykale/transparentML/raw/main/data/Bikeshare.csv"

data_df = pd.read_csv(data_url, header=0, index_col=0)
data_df["mnth"] = data_df["mnth"].astype("category")
data_df["hr"] = data_df["hr"].astype("category")
data_df.head(3)

	mnth	day	hr	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	bikers
season
1	Jan	1	0	0	6	0	clear	0.24	0.2879	0.81	0.0	3	13	16
1	Jan	1	1	0	6	0	clear	0.22	0.2727	0.80	0.0	8	32	40
1	Jan	1	2	0	6	0	clear	0.22	0.2727	0.80	0.0	5	27	32

8.1.2. Linear regression for predicting the number of bikers¶

As in the case of motivating logistic regression, let us predict bikers using linear regression first using the ols function from the statsmodels library. We will use the mnth, hr, workingday, temp, and weathersit features as predictors.

est = ols("bikers ~ mnth + hr + temp + workingday + weathersit", data_df).fit()
est.summary()

OLS Regression Results
Dep. Variable:	bikers	R-squared:	0.675
Model:	OLS	Adj. R-squared:	0.673
Method:	Least Squares	F-statistic:	457.3
Date:	Tue, 06 Feb 2024	Prob (F-statistic):	0.00
Time:	21:48:58	Log-Likelihood:	-49743.
No. Observations:	8645	AIC:	9.957e+04
Df Residuals:	8605	BIC:	9.985e+04
Df Model:	39
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-27.2068	6.715	-4.052	0.000	-40.370	-14.044
mnth[T.Aug]	11.8181	4.698	2.515	0.012	2.609	21.028
mnth[T.Dec]	5.0328	4.280	1.176	0.240	-3.357	13.423
mnth[T.Feb]	-34.5797	4.575	-7.558	0.000	-43.548	-25.611
mnth[T.Jan]	-41.4249	4.972	-8.331	0.000	-51.172	-31.678
mnth[T.July]	3.8996	5.003	0.779	0.436	-5.907	13.706
mnth[T.June]	26.3938	4.642	5.686	0.000	17.294	35.493
mnth[T.March]	-24.8735	4.277	-5.815	0.000	-33.258	-16.489
mnth[T.May]	31.1322	4.150	7.501	0.000	22.997	39.268
mnth[T.Nov]	18.8851	4.099	4.607	0.000	10.850	26.920
mnth[T.Oct]	34.4093	4.006	8.589	0.000	26.556	42.262
mnth[T.Sept]	25.2534	4.293	5.883	0.000	16.839	33.668
hr[T.1]	-14.5793	5.699	-2.558	0.011	-25.750	-3.408
hr[T.2]	-21.5791	5.733	-3.764	0.000	-32.817	-10.341
hr[T.3]	-31.1408	5.778	-5.389	0.000	-42.468	-19.814
hr[T.4]	-36.9075	5.802	-6.361	0.000	-48.281	-25.534
hr[T.5]	-24.1355	5.737	-4.207	0.000	-35.381	-12.890
hr[T.6]	20.5997	5.704	3.612	0.000	9.419	31.781
hr[T.7]	120.0931	5.693	21.095	0.000	108.934	131.253
hr[T.8]	223.6619	5.690	39.310	0.000	212.509	234.815
hr[T.9]	120.5819	5.693	21.182	0.000	109.423	131.741
hr[T.10]	83.8013	5.705	14.689	0.000	72.618	94.985
hr[T.11]	105.4234	5.722	18.424	0.000	94.207	116.640
hr[T.12]	137.2837	5.740	23.916	0.000	126.032	148.536
hr[T.13]	136.0359	5.760	23.617	0.000	124.745	147.327
hr[T.14]	126.6361	5.776	21.923	0.000	115.313	137.959
hr[T.15]	132.0865	5.780	22.852	0.000	120.756	143.417
hr[T.16]	178.5206	5.772	30.927	0.000	167.206	189.836
hr[T.17]	296.2670	5.749	51.537	0.000	284.998	307.536
hr[T.18]	269.4409	5.736	46.976	0.000	258.198	280.684
hr[T.19]	186.2558	5.714	32.596	0.000	175.055	197.457
hr[T.20]	125.5492	5.704	22.012	0.000	114.369	136.730
hr[T.21]	87.5537	5.693	15.378	0.000	76.393	98.714
hr[T.22]	59.1226	5.689	10.392	0.000	47.970	70.275
hr[T.23]	26.8376	5.688	4.719	0.000	15.688	37.987
weathersit[T.cloudy/misty]	-12.8903	1.964	-6.562	0.000	-16.741	-9.040
weathersit[T.heavy rain/snow]	-109.7446	76.667	-1.431	0.152	-260.031	40.542
weathersit[T.light rain/snow]	-66.4944	2.965	-22.425	0.000	-72.307	-60.682
temp	157.2094	10.261	15.321	0.000	137.095	177.324
workingday	1.2696	1.784	0.711	0.477	-2.228	4.768

Omnibus:	288.526	Durbin-Watson:	0.519
Prob(Omnibus):	0.000	Jarque-Bera (JB):	518.512
Skew:	0.273	Prob(JB):	2.55e-113
Kurtosis:	4.068	Cond. No.	131.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above, we can see that a progression of weather from clear to cloudy results in 12.89 fewer bikers per hour on average. If the weather progresses further (from cloudy) to light rain or snow, then this further results in \(-12.89 - (-66.49) = 53.60\) fewer bikers per hour.

Let us study the coefficients of the mnth and hr features to understand the effect of these features on the response.

# months = ["Jan", "Feb", "March", "April", "May", "June", "July", "Aug", "Sept", "Oct", "Nov", "Dec"]
months = [
    "Jan",
    "Feb",
    "March",
    "May",
    "June",
    "July",
    "Aug",
    "Sept",
    "Oct",
    "Nov",
    "Dec",
]
coef_mnth = [est.params["mnth[T.%s]" % _m] for _m in months]
coef_hr = [est.params["hr[T.%d]" % _h] for _h in range(1, 24)]

Visualise the coefficients of the mnth and hr features using plots.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(months, coef_mnth, "bo-")
ax1.grid()
ax1.set_xlabel("Month")
ax2.plot(np.arange(1, 24), coef_hr, "bo-")
ax2.grid()
ax2.set_xlabel("Hour")

for ax in fig.axes:
    ax.set_ylabel("Coefficient")

plt.show()

From the above figures, we can see that bike usage is highest in the spring and fall, and lowest during the winter months. Furthermore, bike usage is greatest around rush hour (9 AM and 6 PM), and lowest overnight. Thus, at first glance, fitting a linear regression model to the Bikeshare dataset seems to provide reasonable and intuitive results.

Let us examine the distribution of the estimated response values. We will use the predict function from the statsmodels library to obtain the predicted values.

fig = plt.figure(figsize=(12, 7))
y_pred = est.predict(data_df.loc[:, ["mnth", "hr", "temp", "workingday", "weathersit"]])
sns.histplot(y_pred, kde=True)
plt.show()

By careful inspection, we can identify several issues from the figure above, e.g. 9.6% of the fitted values are negative! Thus, the linear regression model predicts a negative number of users during 9.6% of the hours in the dataset, which does not make sense. Furthermore, the fitted values are not integers, which is also not desirable.

This casts doubt on the suitability of linear regression for performing meaningful predictions on the Bikeshare data. It also raises concerns about the accuracy of the coefficient estimates, confidence intervals, and other outputs of the regression model.

Let us further examine the distribution of the response variable bikers against the hr feature using the original scale and the log scale.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25, 7))

sns.stripplot(data=data_df, x="hr", y="bikers", ax=ax1, alpha=0.1, color=".2")
sns.boxplot(
    data=data_df,
    x="hr",
    y="bikers",
    ax=ax1,
    width=0.9,
    palette="vlag",
    meanline=True,
    showmeans=True,
)
ax1.set_ylabel("Number of Bikers")

sns.stripplot(data=data_df, x="hr", y="bikers", ax=ax2, alpha=0.1, color=".2")
sns.boxplot(
    data=data_df,
    x="hr",
    y="bikers",
    ax=ax2,
    width=0.9,
    palette="vlag",
    meanline=True,
    showmeans=True,
)
ax2.set_yscale("log")
ax2.set_ylabel("log(Number of Bikers)")
for ax in fig.axes:
    ax.set_xlabel("Hour")
plt.show()

/tmp/ipykernel_2345/623161877.py:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(

/tmp/ipykernel_2345/623161877.py:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(

This figure corresponds to Figure 4.14 in the textbook. Intuitively, when the expected number of bikers is small, its variance should be small as well. For example, at 2 AM during a heavy December snow storm, we expect that extremely few people will use a bike, and moreover that there should be little variance associated with the number of users during those conditions. This is borne out in the data: between 1 AM and 4 AM, in December, January, and February, when it is raining, there are 5.05 users, on average, with a standard deviation of 3.73. By contrast, between 7 AM and 10 AM, in April, May, and June, when skies are clear, there are 243.59 users, on average, with a standard deviation of 131.7. The mean-variance relationship is displayed in the left-hand panel of the figure above. This is a major violation of the assumptions of a linear model, which state that \(y = \sum_{d=1}^{D}x_d\beta_d + \epsilon\), where \(\epsilon\) is a mean-zero error term with variance \(\sigma^2\) that is constant, and not a function of the features (covariates). Therefore, the heteroscedasticity of the data calls into question the suitability of a linear regression model.

Finally, the response bikers is integer-valued. But under a linear model, \(y = \beta_0 + \sum_{d=1}^{D}x_d\beta_d + \epsilon\), where \(\epsilon\) is a continuous-valued error term. This means that in a linear model, the response \(y\) is necessarily continuous-valued (quantitative). Thus, the integer nature of the response bikers suggests that a linear regression model is not entirely satisfactory for this dataset.

Some of the problems that arise when fitting a linear regression model to the Bikeshare data can be overcome by transforming the response; for instance, we can fit the model

\[ \log(y) = \beta_0 + \sum_{d=1}^{D}x_d\beta_d + \epsilon. \]

Transforming the response avoids the possibility of negative predictions, and it overcomes much of the heteroscedasticity in the untransformed data, as is shown in the right-hand panel of the figure above.

However, it is not quite a satisfactory solution, since predictions and inference are made in terms of the log of the response, rather than the response. This leads to challenges in interpretation, e.g. a one-unit increase in \(x_d\) is associated with an increase in the mean of the log of \(y\) by an amount \(\beta_d\). Furthermore, a log transformation of the response cannot be applied in settings where the response can take on a value of 0. Thus, while fitting a linear model to a transformation of the response may be an adequate approach for some count-valued datasets, it often leaves something to be desired. This motivates the development of a Poisson regression model that provides a much more natural and elegant approach for this task.

8.1.3. Poisson regression for predicting the number of bikers¶

To overcome the inadequacies of linear regression for analysing the Bikeshare dataset, we will study a GLM approach called Poisson regression.

8.1.3.1. The Poisson distribution¶

Let us first introduce the Poisson distribution. Suppose that a random variable \(y\) takes on nonnegative integer values, i.e. \(y \in \{0, 1, 2, \dots\}\). If \(y\) follows the Poisson distribution, then

(8.1)¶\[\mathbb{P}(y=t) = \frac{\lambda^t e^{-\lambda}}{t!}, \quad t = 0, 1, 2, \dots .\]

Here, \(\lambda>0\) is a positive real number, called the rate parameter, which is the expected value of \(y\), i.e. \(\mathbb{E}(y)\). It turns out that \(\lambda\) also equals the variance of \(y\), i.e. \(λ = \mathbb{E}(y) = \text{Var}(y)\). This means that if \(y\) follows the Poisson distribution, then the larger the mean of \(y\) , the larger its variance. The notation \( t! \) in Equation (8.1) denotes the factorial of \(t\), i.e. \( t! = t \times (t − 1) \times (t − 2) \times \cdots \times 3 \times 2 \times 1\). The Poisson distribution is typically used to model counts, a natural choice since counts, like the Poisson distribution, take on non-negative integer values.

How to interpret the Poisson distribution?

To see how we might use the Poisson distribution in practice, let \(y\) denote the number of users of the bike sharing programme during a particular hour of the day, under a particular set of weather conditions, and during a particular month of the year. We can model \(y\) as a Poisson distribution with mean \(\mathbb{E}(y) = \lambda = 5\). This means that the probability of no users during this particular hour is \(\mathbb{P}(y = 0) = \frac{e^{-5}5^0}{0!} = e^{-5} = 0.0067 \), where 0! = 1 by convention. The probability that there is exactly one user is \( \mathbb{P}(y = 1) = \frac{e^{-5}5^1}{1!} = 0.034 \), the probability of two users is \( \mathbb{P}(y = 2) = \frac{e^{-5}5^2}{2!} = 0.084 \) and so on.

8.1.3.2. The Poisson regression model¶

In reality, we expect the mean number of users of the bike sharing program, \( \lambda = \mathbb{E}(y) \), to vary as a function of the hour of the day, the month of the year, the weather conditions, and so forth. So rather than modelling the number of bikers, \(y\) , as a Poisson distribution with a fixed mean value like \(\lambda = 5 \), we would like to allow the mean to vary as a function of the features (covariates). In particular, we consider the following model for the mean \( \lambda = \mathbb{E}(y) \), which we now write as \( \lambda(x_1, \cdots, x_D) \) to emphasise that it is a function of the features \( x_1, \cdots, x_D \):

\[ \log(\lambda(x_1, \cdots, x_D)) = \beta_0 + \sum_{d=1}^{D}x_d\beta_d. \]

or equivalently,

(8.2)¶\[\lambda(x_1, \cdots, x_D) = e^{\beta_0 + \sum_{d=1}^{D}x_d\beta_d}.\]

Here, \( \beta_0, \beta_1, \cdots, \beta_D \) are parameters to be estimated. Together, the above equations define the Poisson regression model.

To estimate the coefficients (parameters) \( \beta_0, \beta_1, \cdots, \beta_D \), we use the same maximum likelihood approach that we adopted for logistic regression. The likelihood function for the Poisson regression model is

\[ \ell(\beta_0, \beta_1, \cdots, \beta_D) = \prod_{n=1}^{N} \frac{\lambda(\mathbf{x}_n)^{y_n} e^{-\lambda(\mathbf{x}_n)}}{y_n!}, \]

where \(\lambda(\mathbf{x}_n) = e^{\beta_0 + \sum_{d=1}^{D}x_{nd}\beta_d}\), and \(y_n\) is the response for the \(n\text{th}\) sample (observation). We obtain the coefficient estimates as those maximise the likelihood \(\ell(\beta_0, \beta_1, \cdots, \beta_D)\), i.e. those make the observed data as likely as possible.

We now fit a Poisson regression model to the Bikeshare dataset using the statsmodels package.

est = poisson("bikers ~ mnth + hr + temp + workingday + weathersit", data_df).fit()
est.summary()

Optimization terminated successfully.
         Current function value: 16.256725
         Iterations 9

Poisson Regression Results
Dep. Variable:	bikers	No. Observations:	8645
Model:	Poisson	Df Residuals:	8605
Method:	MLE	Df Model:	39
Date:	Tue, 06 Feb 2024	Pseudo R-squ.:	0.7459
Time:	21:49:00	Log-Likelihood:	-1.4054e+05
converged:	True	LL-Null:	-5.5298e+05
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	3.3854	0.010	333.686	0.000	3.365	3.405
mnth[T.Aug]	0.1296	0.005	26.303	0.000	0.120	0.139
mnth[T.Dec]	-0.0048	0.005	-0.951	0.342	-0.015	0.005
mnth[T.Feb]	-0.4656	0.006	-76.909	0.000	-0.478	-0.454
mnth[T.Jan]	-0.6917	0.007	-98.996	0.000	-0.705	-0.678
mnth[T.July]	0.0821	0.005	15.545	0.000	0.072	0.092
mnth[T.June]	0.2017	0.005	41.673	0.000	0.192	0.211
mnth[T.March]	-0.3153	0.005	-58.225	0.000	-0.326	-0.305
mnth[T.May]	0.2189	0.004	49.792	0.000	0.210	0.228
mnth[T.Nov]	0.1287	0.005	27.880	0.000	0.120	0.138
mnth[T.Oct]	0.2460	0.004	56.949	0.000	0.238	0.255
mnth[T.Sept]	0.2120	0.005	46.873	0.000	0.203	0.221
hr[T.1]	-0.4716	0.013	-36.278	0.000	-0.497	-0.446
hr[T.2]	-0.8088	0.015	-55.220	0.000	-0.837	-0.780
hr[T.3]	-1.4439	0.019	-76.631	0.000	-1.481	-1.407
hr[T.4]	-2.0761	0.025	-83.728	0.000	-2.125	-2.027
hr[T.5]	-1.0603	0.016	-65.957	0.000	-1.092	-1.029
hr[T.6]	0.3245	0.011	30.585	0.000	0.304	0.345
hr[T.7]	1.3296	0.009	146.822	0.000	1.312	1.347
hr[T.8]	1.8313	0.009	211.630	0.000	1.814	1.848
hr[T.9]	1.3362	0.009	148.191	0.000	1.318	1.354
hr[T.10]	1.0912	0.009	117.831	0.000	1.073	1.109
hr[T.11]	1.2485	0.009	137.304	0.000	1.231	1.266
hr[T.12]	1.4340	0.009	160.486	0.000	1.417	1.452
hr[T.13]	1.4280	0.009	159.529	0.000	1.410	1.445
hr[T.14]	1.3793	0.009	153.266	0.000	1.362	1.397
hr[T.15]	1.4081	0.009	156.862	0.000	1.391	1.426
hr[T.16]	1.6287	0.009	184.979	0.000	1.611	1.646
hr[T.17]	2.0490	0.009	239.221	0.000	2.032	2.066
hr[T.18]	1.9667	0.009	229.065	0.000	1.950	1.983
hr[T.19]	1.6684	0.009	190.830	0.000	1.651	1.686
hr[T.20]	1.3706	0.009	152.737	0.000	1.353	1.388
hr[T.21]	1.1186	0.009	121.383	0.000	1.101	1.137
hr[T.22]	0.8719	0.010	91.429	0.000	0.853	0.891
hr[T.23]	0.4814	0.010	47.164	0.000	0.461	0.501
weathersit[T.cloudy/misty]	-0.0752	0.002	-34.528	0.000	-0.080	-0.071
weathersit[T.heavy rain/snow]	-0.9263	0.167	-5.554	0.000	-1.253	-0.599
weathersit[T.light rain/snow]	-0.5758	0.004	-141.905	0.000	-0.584	-0.568
temp	0.7853	0.011	68.434	0.000	0.763	0.808
workingday	0.0147	0.002	7.502	0.000	0.011	0.018

Let us do the same plots as before to visualise the relationship between the number of bikers and the features mnth and hr.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(months, coef_mnth, "bo-")
ax1.grid()
ax1.set_xlabel("Month")
ax2.plot(np.arange(1, 24), coef_hr, "bo-")
ax2.grid()
ax2.set_xlabel("Hour")

for ax in fig.axes:
    ax.set_ylabel("Coefficient")

plt.show()

We again see that bike usage is highest in the spring and fall and during rush hour, and lowest during the winter and in the early morning hours. In general, bike usage increases as the temperature increases, and decreases as the weather worsens. Interestingly, the coefficient associated with workingday is statistically significant under the Poisson regression model, but not under the linear regression model.

8.1.3.3. Key distinctions between Poisson and linear regression¶

Some important distinctions between the Poisson regression model and the linear regression model are as follows:

Interpretation: To interpret the coefficients in the Poisson regression model, we must pay close attention to Equation (8.2), which states that an increase in \(x_d\) by one unit is associated with a change in \(\mathbb{E}(y) = \lambda\) by a factor of \(\textrm{exp}(\beta_d)\). For example, a change in weather from clear to cloudy skies is associated with a change in mean bike usage by a factor of \(\textrm{exp}(−0.08) = 0.923\), i.e. on average, only 92.3% as many people will use bikes when it is cloudy relative to when it is clear. If the weather worsens further and it begins to rain, then the mean bike usage will further change by a factor of \( \textrm{exp} (−0.5) = 0.607\), i.e. on average only 60.7% as many people will use bikes when it is rainy relative to when it is cloudy.
Mean-variance relationship: As mentioned earlier, under the Poisson model, \( \lambda = \mathbb{E}(y) = \text{Var}(y) \). Thus, by modelling bike usage with a Poisson regression, we implicitly assume that mean bike usage in a given hour equals the variance of bike usage during that hour. By contrast, under a linear regression model, the variance of bike usage always takes on a constant value. From the above figure (the one corresponding to Figure 4.14 in the textbook), in the Bikeshare data, when biking conditions are favourable, both the mean and the variance in bike usage are much higher than when conditions are unfavourable. Thus, the Poisson regression model is able to handle the mean-variance relationship seen in the Bikeshare data in a way that the linear regression model is not.
Nonnegative fitted values: There are no negative predictions using the Poisson regression model. This is because the Poisson model itself only allows for nonnegative values, as shown in Equation (8.1). By contrast, when we fit a linear regression model to the Bikeshare dataset, almost 10% of the predictions were negative.

8.1.4. GLM characteristics¶

Based on the two GLM models we have seen so far, the logistic regression model and the Poisson regression model, we can summarise the common characteristics of GLMs as follows:

Each approach uses features (predictors) \( x_1, x_2, \dots, x_D \) to predict a response (output) \( y \). We assume that, conditional on \( x_1, x_2,\dots, x_D \), \( y \) belongs to a certain family of distributions. For linear regression, we typically assume that \(y\) follows a Gaussian (normal) distribution. For logistic regression, we assume that \(y\) follows a Bernoulli distribution. Finally, for Poisson regression, we assume that \(y\) follows a Poisson distribution.
Each approach models the mean (expectation) of \(y\) as a function of the features (predictors). In linear regression, the mean of \(y\) takes the form

\[ \mathbb{E}(y|x_1, x_2, \dots, x_D) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D. \]

In logistic regression, the mean of \(y\) takes the form

\[ \mathbb{E}(y|x_1, x_2, \dots, x_D) = \mathbb{P}(y=1|x_1, x_2, \dots, x_D) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}}. \]

In Poisson regression, it takes the form

\[ \mathbb{E}(y|x_1, x_2, \dots, x_D) = \lambda(x_1, x_2, \dots, x_D) = e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}. \]

The above three equations can be expressed using a link function, \( \eta \), which applies a transformation to \( \mathbb{E}(y|x_1, x_2, \dots, x_D) \) so that the transformed mean is a linear function of the features (predictors). That is

(8.3)¶\[\eta(\mathbb{E}(y|x_1, x_2, \dots, x_D)) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D.\]

The link functions for linear, logistic and Poisson regression are \(\eta(\mu) = \mu \), \( \eta(\mu) = \log(\mu/(1 − \mu)) \), and \(\eta(\mu) = \log(\mu)\), respectively.

The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the exponential family. Other well known members of this family are the exponential distribution, the Gamma distribution, and the negative binomial distribution. In general, we can perform a regression by modelling the response \(y\) as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the features (predictors) via the link function above in Equation (8.3). Any regression approach that follows this very general recipe is known as a generalized linear model (GLM). Thus, linear regression, logistic regression, and Poisson regression are three examples of GLMs. Other examples not covered here include Gamma regression and negative binomial regression.

8.1.5. Exercises¶

1. Logistic regression is a generalised linear model (GLM).

   a. True

   b. False

Compare your answer with the solution below

b. True.

2. There are no negative predictions using the Poisson regression model.

   a. True

   b. False

Compare your answer with the solution below

b. True.

3. For linear regression, we typically assume that \(y\) (output) follows a Bernoulli distribution.

   a. True

   b. False

Compare your answer with the solution below

b. False. Linear regression assumes that the output (\(y\)) follows a Gaussian(normal) distribution

Transparent ML Intro

Generalised linear models

Contents

8.1. Generalised linear models¶

8.1.1. Count data¶

8.1.2. Linear regression for predicting the number of bikers¶

8.1.3. Poisson regression for predicting the number of bikers¶

8.1.3.1. The Poisson distribution¶

8.1.3.2. The Poisson regression model¶

8.1.3.3. Key distinctions between Poisson and linear regression¶

8.1.4. GLM characteristics¶

8.1.5. Exercises¶