import datetime
import json
import os
import pickle
import re
import urllib
import community
import contextily as cx
import folium
import matplotlib
import matplotlib.pyplot as plt
import movingpandas as mpd
import networkx as nx
import numpy as np
import pandas as pd
import pyproj
import requests as requests
import seaborn as sns
from folium.plugins import TimestampedGeoJson
from geopandas import GeoDataFrame, points_from_xy
from shapely.geometry import Point
from tqdm.auto import tqdm
# Set the default figure size
"figure.figsize"] = (10, 6)
plt.rcParams[
# This sets the environment variable PROJ_LIB to the directory containing the data files
# needed by the pyproj library. This allows the pyproj library to find and access the
# data files it needs.
"PROJ_LIB"] = pyproj.datadir.get_data_dir() os.environ[
Reviewers
- Ryan Chan
- David Llewellyn-Jones
Introduction
In recent years, London has become a much more cyclist-friendly city, thanks in large part to the Transport for London (TfL) Santander Bike Scheme. The scheme was launched in July 2010 (until 2015 the scheme was sponsored by Barclays Bank) and has since become a popular way for Londoners to get around the city.
The bicycles, colloquially called “Boris Bikes” after the Mayor of London at the time, are part of a sharing system that allows users to pick them up and drop them off at designated docking stations located around London. The bikes are available 24 hours a day, seven days a week, and are used by Londoners and visitors for a variety of journeys, from commuting to leisure rides. Riders can purchase access to the scheme through a range of payment options, including yearly and monthly memberships as well as a day pass. The scheme also offers a variety of discounts for students, workers of the National Health Service and users of the Cycle to work scheme.
The scheme has been praised for its convenience and affordability and has become a popular way for Londoners to get around the city. According to some studies it has had a positive effect on the health of its users by increasing physical activity within the capital. As reported by TfL in mid 2022 more than 111.2 million journeys had been made using the cycles, with the record for cycle hires in a single day being 73,000. In October 2022, TfL introduced new e-bikes to the scheme, the first docked e-bikes in London.
In addition to this, the cycling infrastructure has played an essential role in London, as well as in many other cities, during the pandemic. With many people avoiding public transport and opting to cycle instead, the Boris bikes were used by many commuters on their journeys. Several other city bike schemes have also become available in London, such as Lime, Tier, etc., but they are all much more recent and don’t have data publicly available the way Boris Bikes do.
For these reasons, in this Turing Data Story we have decided to study the data set provided by TfL on Boris Bike journeys. It is an opportunity to analyse how the scheme and its use has changed over time and how it has affected the daily lives of Londoners. We will explore general statistics of bike usage such as how the number of bikes in use and their utilisation has changed; the different ways bikes are used and move around the city; and try to understand the patterns of mobility in London.
Table of Contents
- Data description
- Statistics on bike usage
- A day in the life of a bike
- Bike mobility patterns
- Conclusions
Before we get to actual work, in the below cell we import all the Python modules we will need in the course of this story, and set a couple of global constants related to e.g. plotting. In addition to running pip install -r requirements.txt
to install all these dependencies, you may find that you need to install some binary libraries that the geospatial packages depend on. On the authors’ Macs brew install gdal
does the trick.
The story makes use of the Cycle Streets API, which in turn uses OpenStreetMap map data.
Data description
TfL provides an open data set containing all bike journeys since the launch of the scheme in July 2010, available under a permissive licence. The data set has information about each journey such as start and end stations, start and end dates and times, and the duration of the journey. For this story we are using all available data from the start of the scheme to January 2022.
An example of the format of the data provided by TfL is the following:
Rental Id | Duration | Bike Id | End Date | EndStation Id | EndStation Name | Start Date | StartStation Id | Start Station Name |
---|---|---|---|---|---|---|---|---|
85290142 | 300 | 7125 | 25/03/2019 13:37 | 624 | Courland Grove, Wandsworth Road | 25/03/2019 13:32 | 772 | Binfield Road, Stockwell |
85323786 | 180 | 6711 | 26/03/2019 14:25 | 400 | George Street, Marylebone | 26/03/2019 14:22 | 6 | Broadcasting House, Marylebone |
Note that the columns End Date and Start Date are actually datetimes rather than dates. The Duration column is in seconds.
Data processing and cleaning
The data covers a wide time range, and the format has seen several small changes in that period. Consequently having a single table with all the journey data requires a bit of cleaning, harmonising, and general data-wrangling. Some examples of the kinds of things one needs to deal with are: * Data files are provided at the weekly level and need to be downloaded programatically from the TfL web page, * Most of the data comes in CSV files, but some time periods are in Excel spreadsheets, * Station name (string) to station ID (integer) mappings are many-to-many: The same station often has slightly differently spelled names in different files, sometimes the same station has gone by several IDs, and sometimes the same ID has been used for entirely different stations at different times, * Station IDs are integers, except for one (there’s always one isn’t there?) that is a string, * Some rows have missing data, * Columns in CSV files have changed names a few times.
Dealing with all this is a boring affair and can deviate from the purpose of this data story, and therefore we have decided to do the cleaning elsewhere. However, if you enjoy boredom, you can find our data-cleaning script here.
In the end, we have around 100M bicycle journeys of cleaned data (we had to drop a negligibly small number of rows that were missing/malformatted beyond repair), all in one pandas DataFrame, that is, thankfully, at around 7 gigabytes just small enough to handle in memory on a modern laptop.
In addition to bike journeys, we also make use of another data set the TfL provides, that has coordinates for the locations of all stations.
We have uploaded all the necessary data sets for this story onto Zenodo . The below cell downloads the cleaned data into a subfolder called data
.
# This downloads the data from Zenodo if it is not already present in the data folder
# The ! syntax is a IPython Notebook specialty, that allows running shell commands
# in the middle of a Python control structure.
if not os.path.exists("data"):
!mkdir data
if not os.path.exists("data/BorisBikes_journeys_cleaned_data.pickle"):
!wget https://zenodo.org/record/7509993/files/BorisBikes_journeys_cleaned_data.pickle -P data/
if not os.path.exists("data/BorisBikes_station_names.pickle"):
!wget https://zenodo.org/record/7509993/files/BorisBikes_station_names.pickle -P data/
if not os.path.exists("data/BorisBikes_stations_coordinates.json"):
!wget https://zenodo.org/record/7509993/files/BorisBikes_stations_coordinates.json -P data/
Load processed data
Let’s load the data into a pandas DataFrame and take a look at it.
= "data/BorisBikes_journeys_cleaned_data.pickle"
RAW = "data/BorisBikes_stations_coordinates.json"
LOCATION_REF
= pd.read_json(LOCATION_REF).T
station_locations_df station_locations_df.head()
lat | lon | |
---|---|---|
85 | 51.500647 | -0.078600 |
86 | 51.489479 | -0.115156 |
87 | 51.516468 | -0.079684 |
88 | 51.518587 | -0.132053 |
89 | 51.526250 | -0.123509 |
# load raw data
= pd.read_pickle(RAW)
df len(df)
97656969
As mentioned above we have around 100 million rows.
Some trips have bogus dates from before the scheme started, or from the future. We drop those rows from our data set. We do allow for journeys that have no end date, since there’s quite a lot of them, and presumably they could mark bikes that were lost or broke down.
= datetime.datetime(2010, 1, 1)
EARLIEST_DATE # filter out of range dates
= df[df["start_date"] > EARLIEST_DATE]
df # allow NA for end dates
= df[(df["end_date"] > EARLIEST_DATE) | df["end_date"].isna()]
df # also drop entries where start date is after the end date
= df[df["start_date"] < df["end_date"]]
df # recalc duration
"duration"] = df["end_date"] - df["start_date"] df[
We still have the vast majority of our data left after this:
len(df)
97120201
The last journey date observed in our data set is the following:
max(df["end_date"])
Timestamp('2022-01-04 23:59:00')
Statistics on bike usage
After some basic cleaning in the sections above we can start looking at some statistics. Let’s see how many unique bikes we have in our data set:
"bike_id"].nunique() df[
21149
We assume that every one of those 21 thousand unique bike IDs corresponds to a different physical bike. If a bike has changed its ID or an ID has been reused, we have no way of telling that, and we assume it does not happen.
Now, let’s look at the bike with the most trips in our data set:
= df.groupby("bike_id")
bike_groups
# bike with the most trips
# pick arbitrary column (without nulls) to get counts
= bike_groups.count()["filename"]
group_counts = group_counts.idxmax()
b_id = group_counts.loc[b_id]
n_trips
print(
f"""
bike with most trips: {b_id}
did {n_trips} trips
"""
)
bike with most trips: 8875
did 10034 trips
Trips vary in length, so let’s also see what the maximum number of hours a bike has been used is, summing up the durations of all its trips:
# bike with longest sum duration of trips
= bike_groups["duration"].sum()
group_sums = group_sums.idxmax()
b_id = group_sums.loc[b_id]
d_sum
print(
f"""
bike with longest sum duration of trips: {b_id}
total of {d_sum} of time in use
"""
)
bike with longest sum duration of trips: 2143
total of 215 days 05:46:00 of time in use
At the macro level we can look at the distribution of the durations of the journeys (here we have decided to exclude outliers representing very long journeys to focus on the bulk of the distribution).
= plt.subplots()
fig, ax # 10k seconds is a bit less than 3 hours. We use it as the threshold
# to exclude anomalously long trips.
= df["duration"].dt.seconds < 10000
mask_long_trips = df[mask_long_trips]["duration"].dt.seconds / 60
df_duration_minutes = df_duration_minutes.hist(bins=50)
ax "Trip duration (minutes)")
plt.xlabel("Number of trips")
plt.ylabel( plt.show()
We can see in the figure above that the bikes are used mostly for short journeys, with the core of the distribution centered around journeys ~15 minutes long, and most of the journeys are less than 30 minutes. The latter is most likely due to a cut-off where trips get more expensive after the 30 minute mark.
Long-lived bikes
We can use this distribution to calculate the average lifetime of a bike. We define as lifetime the time period in which a bike is present (has recorded trips) in our data set.
To do this, for a given bike we find the earliest start date and the latest end date of the bike available in the data set.
= bike_groups["start_date"].first()
bike_start = bike_groups["end_date"].last()
bike_end = bike_end - bike_start bike_lifetime
The distribution of the lifetimes in days is the following:
= plt.subplots()
fig, ax = bike_lifetime.dt.days.hist(bins=50)
ax "Days")
plt.xlabel("Bikes")
plt.ylabel("Lifetime of a bike")
plt.title( plt.show()
The distribution looks quite flat, with a large peak on around ~3600 days. These are very likely bikes that have been part of the scheme from the beginning.
Now, let’s see what the average utilisation of a bike is. By this we mean the total ride duration divided by its total lifetime.
= bike_groups["duration"].sum()
duration_sums = duration_sums / bike_lifetime bike_utilisation
= plt.subplots()
fig, ax = bike_utilisation.hist(bins=500)
ax 0, 0.15])
plt.xlim(["Fraction of time in use")
plt.xlabel("Bikes")
plt.ylabel("Utilisation")
plt.title( plt.show()
The distribution above shows that during its lifetime an average bike gets used around 45 minutes per day.
We must note the caveat that the lifetime definition does not consider the time bikes are in maintenance or not in service for any given reason. As a consequence this utilisation metric may be an underestimate, because it counts any downtime as the bike being available but not in use.
Time series of bike usage
The number of bikes and stations has evolved since the beginning of the scheme, with more being added every year. Furthermore, London has changed since 2010 and bike usage might have changed with it. Hence we’ll take a look at the number of bikes and stations available and bike utilisation as functions of time.
Our utilisation measure here will be slightly different to our previous figure. Previously we looked at the utilisation at the bike level and averaged this. Now, we’re looking at sum of use over the entire fleet and dividing this by the max possible usage per month (all bikes in use 24/7).
Let’s start by creating a monthly index for the time series:
# don't want to incude first and last months as may be incomplete, use in filter later
= df["start_date"].iloc[[0, -1]].dt.to_period("M")
incomplete_months # create a complete monthly index that covers ALL months in period
= pd.date_range(
complete_monthly_index =df["start_date"].iloc[0], end=df["end_date"].iloc[-1], freq="M"
start"M") ).to_period(
Next we create a function that counts how many bikes or stations are in use in a given month.
def calc_alive_per_month(starts, ends, incomplete_months, complete_monthly_index):
= starts.dt.to_period("M").value_counts()
starts_per_month = ends.dt.to_period("M").value_counts()
ends_per_month
= (
counts_df ="foo")
complete_monthly_index.to_frame(name
.join(starts_per_month)
.join(ends_per_month)
.sort_index()0)
.fillna(
)# ending items should only be counted at the start of next month, so shift
"end_date"] = counts_df["end_date"].shift(fill_value=0)
counts_df[
= counts_df["start_date"].cumsum() - counts_df["end_date"].cumsum()
alive_per_month return alive_per_month[~alive_per_month.index.isin(incomplete_months)]
Using the function defined above we create a dataframe of bikes available at the monthly level.
= calc_alive_per_month(
alive_bikes_per_month =bike_start,
starts=bike_end,
ends=incomplete_months,
incomplete_months=complete_monthly_index,
complete_monthly_index )
We are also interested in seeing the total usage of the bikes per month in terms of the sum of the duration of each journey and the total utilisation.
= (
duration_sums_per_month "duration"]].groupby(df["start_date"].dt.to_period("M"))["duration"].sum()
df[[
)= duration_sums_per_month.to_frame()
duration_sums_per_month "max_possible_duration"] = duration_sums_per_month.index.map(
duration_sums_per_month[lambda x: x.end_time - x.start_time
)= (
utilisation_per_month "duration"]
duration_sums_per_month[/ duration_sums_per_month["max_possible_duration"]
/ alive_bikes_per_month
)# remove incomplete months
= utilisation_per_month[
utilisation_per_month ~utilisation_per_month.index.isin(incomplete_months)
]
Let’s also look at how many stations are available in each month.
= df.groupby("start_station_id")
station_groups # relies on time ordering of df via rental_id
= station_groups["start_date"].first()
station_start = station_groups["end_date"].last() station_end
= calc_alive_per_month(
alive_stations_per_month =station_start,
starts=station_end,
ends=incomplete_months,
incomplete_months=complete_monthly_index,
complete_monthly_index )
Let’s merge our bike usage dataframe with our new station usage data. This results in a monthly indexed dataframe that we can use for visualisation later on:
# forward fill gaps
= (
stats_df ="date")
complete_monthly_index.to_frame(name"alive_bikes"))
.join(alive_bikes_per_month.rename("alive_stations"))
.join(alive_stations_per_month.rename("utilisation"))
.join(utilisation_per_month.rename(="ffill")
.fillna(method )
stats_df.head()
date | alive_bikes | alive_stations | utilisation | |
---|---|---|---|---|
2012-01 | 2012-01 | NaN | NaN | NaN |
2012-02 | 2012-02 | 6592.0 | 521.0 | 0.030871 |
2012-03 | 2012-03 | 8659.0 | 560.0 | 0.046002 |
2012-04 | 2012-04 | 8663.0 | 567.0 | 0.036303 |
2012-05 | 2012-05 | 8813.0 | 569.0 | 0.050691 |
Let’s look at our time series from March 2012.
1:].plot.area(subplots=True)
stats_df["Years")
plt.xlabel( plt.show()
We can see clearly the expansion of the scheme in 2012 where new stations were added to include East London. Another prominent feature is the annual periodicity, where bike utilisation is, unsurprisingly, higher in the summer. It ranges from roughly ~2% in most winters to 5-6% in the summers, so at any given moment, on average something between 1-in-20 to 1-in-50 bikes are being ridden.
The effect of the pandemic in 2020 is also evident in data as a spike in utilisation in 2020, which generally seems to have slowly decreased after the excitement of the first couple of years of the scheme. Note that the number of “alive” (in use) bikes also has a temporary spike in the pandemic, which we guess is the TfL responding to increased demand.
Looking at the behaviour of individual bikes
For the next section of the story we are interested in looking at the pattern of movements of bikes. For this we want to follow the different journeys a bike can take during a time period. For this we need to define a new concept which we call a chain. A chain is a sequence of trips for a given bike, where the start location matches the previous end location. In other words, a new chain starts every time a bike moves to a new location without the journey appearing in the database, such as when moved by TfL or taken in for repairs.
Creating chains for all available bikes in this data set can be time consuming and we are only interested in a few examples. We will get only a subset of the chains: let’s look at bikes with the longest lifetimes.
Let’s get the top ten longest living bikes in our data set.
= bike_lifetime.sort_values()[-10:].index.values top_ten_lived_bike_ids
= df[df["bike_id"].isin(top_ten_lived_bike_ids)].copy() top_ten_bike_subset
Let’s create a function that for a given bike creates a chain ID for each sequence of journeys. The chain ID changes once the starting station of a journey is not the same as the end station from the previous one.
def add_chains(bike_id, bike_group):
# note fillna for end station to allow for comparison to NA
= bike_group[
breaks "start_station_id"]
bike_group[!= bike_group.shift()["end_station_id"].fillna(-1)
]= breaks.index.values
break_indices
= list()
chains for i, (start, end) in enumerate(
zip([None, *break_indices], [*break_indices, None])
):= bike_group.loc[start:end]
chain = f"{bike_id}_{i}"
chain_id =chain.index))
chains.append(pd.Series(chain_id, indexreturn pd.concat(chains)
Using the function defined above, lets get the chains for the top ten longest lived bikes.
= list()
chains for k, g in tqdm(top_ten_bike_subset.groupby("bike_id")):
= bike_groups.get_group(k)
g =k, bike_group=g))
chains.append(add_chains(bike_id
= top_ten_bike_subset.join(pd.concat(chains).rename("chain_id"))
top_ten_bike_subset
# chain creation can create some duplicate rows, that are removed here.
= top_ten_bike_subset[~top_ten_bike_subset.index.duplicated(keep='first')] top_ten_bike_subset
Let’s look at number of journeys a typical chain can have (for the top ten longest lived bikes).
# Plotting the histogram
'chain_id').size(), bins=40)
plt.hist(top_ten_bike_subset.groupby(0, 50])
plt.xlim(['Histogram of journeys on a chain')
plt.title('Number of journeys on a chain')
plt.xlabel('Number of chains')
plt.ylabel( plt.show()
It can be quite surprising to see bikes being frequently transported between stations by means other than cyclists, possibly by TfL trucks. However, it’s important to note that this distribution only contains data from the top ten longest-lived bikes and may not accurately represent the overall fleet of bikes.
A day in the life of a bike
Now that we have “chains” associated to our long lived bikes we can use them to follow them around for a given time period. We are interested to see how they move around the city.
There is still some missingness in our data set, in the form of stations IDs not linked to our station data set. Let’s write a simple function to remove journeys where the station information is missing.
def remove_missing_stations(df, stations):
def check_id(row, stations):
= str(int(row["start_station_id"]))
start_id = str(int(row["end_station_id"]))
end_id if str(start_id) in stations.keys() and str(end_id) in stations.keys():
return True
return False
"check_stations_ids"] = df.apply(lambda row: check_id(row, stations), axis=1)
df[= df[df.check_stations_ids.eq(True)]
df return df
with open("data/BorisBikes_stations_coordinates.json") as f:
= json.load(f)
stations
= remove_missing_stations(top_ten_bike_subset, stations)
data data.head()
duration | bike_id | end_date | end_station_id | end_station_name | start_date | start_station_id | start_station_name | filename | chain_id | check_stations_ids | |
---|---|---|---|---|---|---|---|---|---|---|---|
rental_id | |||||||||||
9340768 | 0 days 00:20:00 | 893 | 2012-01-04 00:20:00 | 169 | Porchester Place, Paddington | 2012-01-04 00:00:00 | 224 | Queensway, Kensington Gardens | ../bikes/1. Journey Data Extract 04Jan-31Jan 1... | 893_0 | True |
9343209 | 0 days 00:17:00 | 3278 | 2012-01-04 00:23:00 | 114 | Park Road (Baker Street), Regent's Park | 2012-01-04 00:06:00 | 49 | Curzon Street, Mayfair | ../bikes/1. Journey Data Extract 04Jan-31Jan 1... | 3278_0 | True |
9343534 | 0 days 00:18:00 | 540 | 2012-01-04 00:45:00 | 211 | Cadogan Place, Knightsbridge | 2012-01-04 00:27:00 | 64 | William IV Street, Strand | ../bikes/1. Journey Data Extract 04Jan-31Jan 1... | 540_0 | True |
9344212 | 0 days 00:15:00 | 17 | 2012-01-04 00:20:00 | 340 | Bank of England Museum, Bank | 2012-01-04 00:05:00 | 88 | Bayley Street , Bloomsbury | ../bikes/1. Journey Data Extract 04Jan-31Jan 1... | 17_0 | True |
9346348 | 0 days 00:19:00 | 791 | 2012-01-04 00:25:00 | 33 | Altab Ali Park, Whitechapel | 2012-01-04 00:06:00 | 48 | Godliman Street, St. Paul's | ../bikes/1. Journey Data Extract 04Jan-31Jan 1... | 791_0 | True |
Now we can make this code a bit more formal by building objects that represent Bikes and Trips.
A trip is a journey between two stations. The data from TfL only provides a journey starting and final station but we can use the Cycle Streets API to build the most probable route between two points given the duration of that journey.
Our Trip object (containing the journey and route data) is defined as the following:
class Trip:
def __init__(self, data, bike_id, trip_id, station_data):
= data[data.index == trip_id]
df
self.init_station = {
"name": df.start_station_name.values[0],
"id": df.start_station_id.values[0],
"latitude": station_data[str(int(df.start_station_id.values[0]))]["lat"],
"longitude": station_data[str(int(df.start_station_id.values[0]))]["lon"],
}self.end_station = {
"name": df.end_station_name.values[0],
"id": df.end_station_id.values[0],
"latitude": station_data[str(int(df.end_station_id.values[0]))]["lat"],
"longitude": station_data[str(int(df.end_station_id.values[0]))]["lon"],
}self.bike = df.bike_id.values[0]
self.duration = df.duration.values[0]
self.date = {
"start": df.start_date.values[0],
"end": df.end_date.values[0],
}self.circular = self.init_station == self.end_station
self.route = {}
self.bike_id = bike_id
self.trip_id = trip_id
def get_route(self, key, route_path="routes/"):
if not os.path.exists(route_path):
os.makedirs(route_path)
= (
route_file_path + str(self.bike_id) + "_" + str(self.trip_id) + ".json"
route_path
)if os.path.isfile(route_file_path):
with open(route_file_path, "r") as fp:
= json.load(fp)
data self.route = data
else:
if self.circular:
self.route = {}
else:
= ["balanced", "fastest", "quietest", "shortest"]
plans
= None
closest_time = {}
trip_data
for plan in plans:
= "{},{}|{},{}".format(
itinerarypoints self.init_station["longitude"],
self.init_station["latitude"],
self.end_station["longitude"],
self.end_station["latitude"],
)= "https://www.cyclestreets.net/api/journey.json?" + urllib.parse.urlencode(
name
{"key": key,
"itinerarypoints": itinerarypoints,
"plan": plan,
}
)= requests.get(name).json()["marker"][0]["@attributes"]
data = int(data["time"])
time if closest_time is None:
= abs(time - self.duration)
closest_time = data
trip_data
elif abs(self.duration - time) < closest_time:
= abs(time - self.duration)
closest_time = data
trip_data
self.route = trip_data
with open(route_file_path, "w") as fp:
self.route, fp) json.dump(
A Bike object represents a bike identified by its ID. The Bike object has methods that allow you to fetch its “story” which is made up of all the trips recorded in the data and routes obtained from Cycle Streets.
In order to fetch data from the Cycle Streets API you need to apply for access and obtain a key.
# you should store the key in a file called cycle_street_key.txt inside the data folder
= open("data/cycle_street_key.txt", "r").readline() key
class Bike:
def __init__(self, id):
self.id = id
def get_chains(self, stations):
= self.bike_rides.chain_id.to_list()
chain_ids = {}
chains for chain_id in chain_ids:
= self.bike_rides[self.bike_rides["chain_id"] == chain_id]
chain_rides = [
chains[chain_id] self.id, trip_id, stations)
Trip(chain_rides, for trip_id in chain_rides.index
]self.chains = chains
def get_story(self, dataset, stations, key):
= dataset[dataset["bike_id"] == self.id]
bike_rides self.bike_rides = bike_rides
self.get_chains(stations)
for chain_id, chain in self.chains.items():
for counter, trip in enumerate(chain):
trip.get_route(key)if trip.route == {}:
continue
We can visualise the journeys of a bike on a map using the libraries Folium and movingpandas.
def get_colours(steps):
= sns.color_palette("mako").as_hex()
colours = sns.color_palette("mako").as_hex()
rev_colours
rev_colours.reverse()= rev_colours + colours
colours while len(colours) < steps:
+= colours
colours return colours
def get_trajectory(bike_id, route_folder="routes/"):
= [
trip_list
filenamefor filename in sorted(os.listdir(route_folder))
if str(bike_id) + "_" in filename
]
= []
times = []
geometry = []
colours
= get_colours(len(trip_list))
many_colurs
for c in range(len(trip_list)):
= trip_list[c]
chain with open(route_folder + chain) as f:
= json.load(f)
d if len(d) > 0:
+= [
geometry float(y) for y in x.split(",")])
Point([for x in d["coordinates"].split(" ")
]if len(times) == 0:
= datetime.datetime.now()
time_now else:
= times[-1]
time_now += [
times + datetime.timedelta(seconds=1 * t + 1)
time_now for t in range(len(d["coordinates"].split(" ")))
]+= [many_colurs[c] for x in range(len(d["coordinates"].split(" ")))]
colours
= pd.DataFrame()
df
"t"] = times
df["trajectory_id"] = [1 for x in range(len(geometry))]
df["sequence"] = [x + 1 for x in range(len(geometry))]
df["colour"] = colours
df[
= GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
gdf = gdf.set_index("t")
gdf
= mpd.TrajectoryCollection(gdf, "trajectory_id")
trajs = mpd.MinTimeDeltaGeneralizer(trajs).generalize(
trajs =datetime.timedelta(seconds=1)
tolerance
)= trajs.trajectories[0]
traj return traj
def draw_map(traj):
= traj_to_timestamped_geojson(traj)
features # Create base map
= [51.506949, -0.122876]
London map = folium.Map(location=London, zoom_start=12, tiles="cartodbpositron")
TimestampedGeoJson(
{"type": "FeatureCollection",
"features": features,
},="PT1S",
period=False,
add_last_point=10,
transition_timemap)
).add_to(return map
def traj_to_timestamped_geojson(trajectory):
= []
features = trajectory.df.copy()
df "previous_geometry"] = df["geometry"].shift()
df["time"] = df.index
df["previous_time"] = df["time"].shift()
df[for _, row in df.iloc[1:].iterrows():
= [
coordinates
["previous_geometry"].xy[0][0],
row["previous_geometry"].xy[1][0],
row[
],"geometry"].xy[0][0], row["geometry"].xy[1][0]],
[row[
]= [row["previous_time"].isoformat(), row["time"].isoformat()]
times
features.append(
{"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": coordinates,
},"properties": {
"times": times,
"style": {
"color": row["colour"],
"weight": 5,
},
},
}
)return features
= 893
bike_id = data[
selected_data "start_date"] > "2020-03-23") & (data["start_date"] < "2020-05-14")
(data[
]= Bike(id=bike_id)
bike bike.get_story(selected_data, stations, key)
= get_trajectory(bike_id)
traj = draw_map(traj)
map_trajectory map_trajectory
In this map you can see a series of trips done by a single bike during the first lockdown, between March and May 2020. Different colours highlight different journeys. Straight lines jumps in the map mean the bike was moved to another station by TfL and not a cyclist, therefore that chain has ended. This is a static image created for rendering in fast-pages. If you want to see the animated version, take a look at this nbviewer link.
If you are running locally you can take a look at the story of other bikes, for instance ID 3278
- you might be surprised by what you find!
Bike mobility patterns
The patterns of mobility observed in this data set can be abstracted using networks built from trips between stations.
We construct a bidirectional weighted graph/network where the nodes are stations and the weight of an edge from station A to station B is the number of trips taken from A to B. As we’ll see below, this reveals some high-level patterns in different regions and around different stations in London, that vary between seasons, weekdays, and times of the day.
For the network analysis, we use the networkx
package. We start by writing a function that builds a networkx
network from a dataframe of trip data. A parameter trip_count_threshold
is used to exclude edges that have a very small weight, to make the analysis computationally more tractable.
def create_network_from_data(df, trip_count_threshold=1e-5):
= (
trip_counts
("start_station_id", "end_station_id", "bike_id"]]
df[["start_station_id", "end_station_id"])
.groupby([
.count()
)
.reset_index()={"bike_id": "trip_count"})
.rename(columns
)= trip_counts.sort_values("trip_count")
trip_counts = trip_counts["trip_count"].sum()
total_num_trips
= trip_counts[
trip_counts "trip_count"] >= trip_count_threshold * total_num_trips
trip_counts[
]
= nx.from_pandas_edgelist(
graph
trip_counts,="start_station_id",
source="end_station_id",
target="trip_count",
edge_attr=nx.DiGraph,
create_using
)
return graph
As we will see in a moment, our network, like many naturally occurring networks, has so called communities: these are clusters of nodes that are strongly linked to each other, and less linked to nodes outside the cluster. In our case this corresponds to groups of bike docking stations that have a lot of traffic amongst each other.
Network theorists have developed many different kinds of algorithms for detecting community structure in networks. We use the Louvain algorithm, and its implementation in the python-louvain
package. The Louvain algorithm chooses the communities by trying to maximise a metric called modularity, that roughly speaking measures how much total weight there is in edges that are internal to communities, compared to weight in edges that cross from one community to another. It is a greedy and quite fast (O(N log N)) algorithm, that works by hierarchically grouping densely connected nodes together and replacing them with single nodes that represent small communities, grouping those small communities to form bigger communities, etc. You can find more details on Wikipedia or the documentation of the python-louvain
package.
Below we write a function that does such community detection. Note that we need to ignore the directionality of our graph, i.e. the distinction between bikes going from station A to station B rather than from B to A. This is a limitation of the Louvain algorithm, but for community-detection purposes the directionality isn’t important.
def network_community_detection(graph, edge_weight):
= nx.Graph()
graph_undirected = set(sorted(graph.edges))
undirected_edges for edge in undirected_edges:
= (edge[1], edge[0])
reverse_edge = graph.edges[edge][edge_weight]
trip_count if reverse_edge in graph.edges:
+= graph.edges[reverse_edge][edge_weight]
trip_count 0], edge[1], trip_count=trip_count)
graph_undirected.add_edge(edge[
= community.best_partition(graph_undirected, weight=edge_weight)
partition = pd.DataFrame(partition, index=[0]).T.reset_index()
df_partition = ["id", "partition"]
df_partition.columns return df_partition
Finally a function for visualising these graphs of bike trips and their communities on a map, together with some supporting functions to aid in that job.
= "data/BorisBikes_station_names.pickle"
STATION_NAMES_FILE = LOCATION_REF
STATION_COORDS_FILE
= [
LABEL_STATIONS "Belgrove Street",
"Waterloo Station 3",
"Hyde Park Corner",
"Aquatic Centre",
"Bethnal Green Road",
"Natural History Museum",
"Kennington Oval",
"Mudchute DLR",
]
def get_station_name(id):
with open(STATION_NAMES_FILE, "rb") as f:
= pickle.load(f)
station_allnames
= sorted(station_allnames[id])[0]
name = re.split(";|,|:", name)[0].strip()
name return name
def get_node_info(graph):
with open(STATION_COORDS_FILE, "r") as f:
= json.load(f)
station_latlon
= graph.nodes()
nodes
= [station_latlon[str(int(node))] for node in nodes]
pos = [(p["lon"], p["lat"]) for p in pos]
pos
= [i[1] for i in list(graph.degree(weight="trip_count"))]
station_sizes
= [get_station_name(int(node)) for node in nodes]
labels
= pd.DataFrame(
nodes_df "id": list(nodes), "pos": pos, "size": station_sizes, "name": labels}
{
)
return nodes_df
def scale_range(values, min_scaled, max_scaled):
= np.array(values)
values if min_scaled is not None:
= np.max(values)
max_value = np.min(values)
min_value = (max_scaled - min_scaled) / (max_value - min_value)
mult_coeff = (max_value * min_scaled - min_value * max_scaled) / (
add_coeff - min_value
max_value
)= mult_coeff * values + add_coeff
scaled else:
= np.max(values)
max_value = max_scaled * values / max_value
scaled return scaled
def drop_stations_without_location(graph):
with open(STATION_COORDS_FILE, "r") as f:
= json.load(f)
station_latlon = tuple(graph.nodes)
nodes = tuple(map(int, station_latlon.keys()))
stations_with_location for n in nodes:
if n not in stations_with_location:
print(f"Removing node {n} because of missing location data.")
graph.remove_node(n)return None
def create_network_and_map(
df,=LABEL_STATIONS,
label_stations=False,
allow_self_loops=True,
arrows=15,
bg_map_zoom
):= create_network_from_data(df)
community_graph
drop_stations_without_location(community_graph)= get_node_info(community_graph)
nodes_info = community_graph.copy()
visualisation_graph if not allow_self_loops:
visualisation_graph.remove_edges_from(nx.selfloop_edges(community_graph))= network_community_detection(community_graph, "trip_count")
community_df = nodes_info.merge(community_df, on="id")
nodes_info = nodes_info.sort_values(by="size", ascending=False)
nodes_info del community_df
"lon"] = [p[0] for p in nodes_info["pos"]]
nodes_info["lat"] = [p[1] for p in nodes_info["pos"]]
nodes_info[
= GeoDataFrame(
nodes_info
nodes_info,=points_from_xy(nodes_info.lon, nodes_info.lat),
geometry="EPSG:4326",
crs
)# Project to UK national grid
= nodes_info.to_crs("EPSG:27700")
nodes_info
= {
labels id: name
for id, name in zip(nodes_info["id"], nodes_info["name"])
if name in label_stations
}
= plt.subplots(1, 1, figsize=(15, 15))
fig, ax =ax, markersize=1, aspect=None)
nodes_info.plot(ax
cx.add_basemap(=nodes_info.crs, source=cx.providers.Stamen.TonerLite, zoom=bg_map_zoom
ax, crs
)
= [
xynps for p in nodes_info["geometry"]]),
np.array([p.x for p in nodes_info["geometry"]]),
np.array([p.y
]= {k: (xynps[0][i], xynps[1][i]) for i, k in enumerate(nodes_info["id"])}
pos
= 300.0
MAX_NODE_SIZE = 5.0
MIN_NODE_SIZE
= scale_range(nodes_info["size"], MIN_NODE_SIZE, MAX_NODE_SIZE)
sizes = np.array(
weights "trip_count"] for e in visualisation_graph.edges]
[visualisation_graph.edges[e][
)
= 3.0
MAX_EDGE_WIDTH = None
MIN_EDGE_WIDTH
= scale_range(weights, MIN_EDGE_WIDTH, MAX_EDGE_WIDTH)
weights
= 0.9
MAX_EDGE_ALPHA = None
MIN_EDGE_ALPHA
= scale_range(weights, MIN_EDGE_ALPHA, MAX_EDGE_ALPHA)
edge_alpha
# Plots
nx.draw_networkx_nodes(
visualisation_graph,=pos,
pos=nodes_info["id"],
nodelist=nodes_info["partition"],
node_color=1.0,
alpha=sizes,
node_size="tab10",
cmap=ax,
ax
)
nx.draw_networkx_edges(
visualisation_graph,=pos,
pos="#222222",
edge_color=weights,
width=edge_alpha,
alpha=arrows,
arrows=ax,
ax
)
nx.draw_networkx_labels(
visualisation_graph,=pos,
pos=labels,
labels=12,
font_size=ax,
ax
)
return fig, ax, nodes_info
To make the network analysis computationally manageable, we restrict our attention to data from a single year, in this case 2021.
= datetime.datetime(year=2010, month=1, day=1)
HARD_START_DATE = datetime.datetime(year=2021, month=1, day=1)
start_date = datetime.datetime(year=2022, month=1, day=1)
end_date = df[(df["start_date"] > HARD_START_DATE) & (df["end_date"] > HARD_START_DATE)]
df = df[(df["start_date"] > start_date) & (df["start_date"] < end_date)] df_year
Plot time! Below is the network of bike trips aggregated over weekday mornings only, in 2021. The colours of the nodes mark the different communities as discovered by the Louvain algorithm. The lines connecting nodes have arrows indicating the direction of travel and their thickness marks the amount of traffic.
print("Plotting mornings")
= df[
df_year_mornings "start_date"].dt.hour.isin([7, 8, 9, 10])
df[& df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4))
]= create_network_and_map(df_year_mornings)
fig, ax, nodes_info = len(nodes_info["partition"].unique())
num_communities print(f"Number of communities: {num_communities}")
"Weekday mornings (7-10)")
plt.title( plt.show()
Plotting mornings
Number of communities: 5
The two most prominent types of weekday morning traffic are outgoing trips from King’s Cross/St. Pancras (station called Belgrove Street marked on the map) and from Waterloo. Commuters presumably come in on trains and cover the last bit to their workplace by a Boris Bike. These two types of traffic form two of the communities found by the Louvain algorithm, with the others being less concentrated communities for East, West, and South London.
Below is a similar graph but for weekday afternoons rather than mornings. The dominant patterns are the reverse of the morning traffic, with people going from their offices to the big train stations. A new feature is a community of traffic around Hyde Park, presumably pleasure rides in the afternoon by people on holiday.
There are some other shifts in the communities marked by the colours, that we find hard to interpret. There’s also some instability and randomness in the Louvain algorithm, where if you run it several times sometimes it divides the smaller communities in slightly different ways. The commuter communities are robust, though.
print("Plotting afternoons")
= df[
df_year_afternoons "start_date"].dt.hour.isin([15, 16, 17, 18, 19])
df[& df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4))
]= create_network_and_map(df_year_afternoons)
fig, ax, nodes_info = len(nodes_info["partition"].unique())
num_communities print(f"Number of communities: {num_communities}")
"Weekday afternoons (15-19)")
plt.title( plt.show()
Plotting afternoons
Number of communities: 12
Below is the same plot but for weekends. Here the most common trips are around Hyde Park and marked by self-loops, where the bike is returned to the same station where it was picked up. We don’t know where they were ridden in between, but probably around the park for fun or a picnic. The Olympic Park has a similar loop. The communities again show different regions of London, with people travelling within neighbourhoods, although some of them are somewhat surprising and hard to interpret, with outlier stations way outside the bulk of their communities.
print("Plotting weekends")
= df[df["start_date"].dt.weekday.isin((5, 6))]
df_year_weekends = create_network_and_map(
fig, ax, nodes_info
df_year_weekends,=True,
allow_self_loops
)= len(nodes_info["partition"].unique())
num_communities print(f"Number of communities: {num_communities}")
"Weekends")
plt.title( plt.show()
Plotting weekends
Number of communities: 26
Finally, let’s see how patterns have changed over time. Below we create these network plots but this time not for a specific time of day or week, but for specific years: 2013, 2017, 2020, and 2021.
for year in (2013, 2017, 2020):
print(f"Plotting {year}")
= datetime.datetime(year=year, month=1, day=1)
start_date = datetime.datetime(year=year + 1, month=1, day=1)
end_date = df[(df["start_date"] > start_date) & (df["start_date"] < end_date)]
df_year = create_network_and_map(
fig, ax, nodes_info
df_year,=False,
allow_self_loops=False,
arrows
)= len(nodes_info["partition"].unique())
num_communities print(f"Number of communities: {num_communities}")
f"Year {year}")
plt.title( plt.show()
Plotting 2013
Number of communities: 6
Plotting 2017
Number of communities: 10
Plotting 2020
Number of communities: 12
From 2013 to 2017 the big change is the growth of the network of stations. Especially in the west and southwest, but also around for instance the Olympic Park, the Boris Bike network grows a lot.
From 2017 to 2020 the notable change is the effect of the pandemic: Most of the commuter traffic from e.g. King’s Cross and Waterloo has vanished. The dominant traffic patterns that did survive the pandemic were the rides around Hyde Park and the Olympic Park.
Note that we should be careful in reading too much into the coloured communities and how they’ve changed from year to year. First of all, the colours bear no particular meaning, and the same community may be coloured differently in different plots. Second, the Louvain algorithm for discovering them is stochastic, and many of the smaller communities might arrange themselves differently by simply running the same cell again. The communities with the most traffic remain robust though.
Conclusions
In this Turing data story we aimed to use the TfL bike journey data set to understand the usage of Boris Bikes and how it has changed, as well as the patterns in how people move around the capital. This story was born out of a ‘Hack week’ organised by the Research Engineering Group at the Turing, an annual event in which people from the team get to work on projects or data sets they find interesting.
Our strategy was to extract some general statistics about bike usage over time, followed by trying to characterise in diffent ways the movements of some long-lived bikes, and finally to use network analysis techniques to abstract these movements into high level patterns of behaviour.
Among other things, the analysis very clearly showed how the pandemic changed drastically the way we move around the city and the strength of the commute links. Furthermore, in the last couple of years a number of other cycling schemes have been introduced, and significant improvements to the cycling infrastructure might have encouraged people to acquire their own bikes, both of which may have contributed to changes in Boris Bike use. A task for a future analysis (from a Turing Data Story reader :-) ) is to include 2022 and 2023 data and see if this change in behaviour has persisted.