2.5 Data analysis example
Contents
2.5 Data analysis example#
Estimated time for this notebook: 20 minutes
We’re now going to bring together everything we’ve learned about Python so far to perform a simple but complete analysis. We will retrieve data, do some computations based on it, and visualise the results.
As we show the code for different parts of the work, we will be touching on various aspects you may want to keep in mind, either related to Python specifically, or to research programming more generally.
2.5.1 Geolocation#
import geopy # A python library for investigating geographic information. https://pypi.org/project/geopy/
If you try to follow along on this example in an Jupyter notebook, you might find that you just got an error message.
You’ll need to wait until we’ve covered installation of additional python libraries later in the course, then come back to this and try again. For now, just follow along and try get the feel for how programming for data-focused research works.
geocoder = geopy.geocoders.Nominatim(user_agent="rse-course")
geocoder.geocode("Cambridge", exactly_one=False)
[Location(Cambridge, Cambridgeshire, Cambridgeshire and Peterborough, England, United Kingdom, (52.2055314, 0.1186637, 0.0)),
Location(Cambridge, Middlesex County, Massachusetts, United States, (42.3656347, -71.1040018, 0.0)),
Location(Cambridge, Region of Waterloo, Ontario, Canada, (43.3600536, -80.3123023, 0.0)),
Location(Cambridge, Henry County, Illinois, United States, (41.3025257, -90.1962861, 0.0)),
Location(Cambridge, Isanti County, Minnesota, 55008, United States, (45.5727408, -93.2243921, 0.0)),
Location(Cambridge, Story County, Iowa, United States, (41.8990768, -93.5294029, 0.0)),
Location(Cambridge, Dorchester County, Maryland, 21613, United States, (38.5714624, -76.0763177, 0.0)),
Location(Cambridge, Guernsey County, Ohio, 43725, United States, (40.031183, -81.5884561, 0.0)),
Location(Cambridge, Jefferson County, Kentucky, United States, (38.2217369, -85.616627, 0.0)),
Location(Cambridge, Cowley County, Kansas, United States, (37.316988, -96.66633224663678, 0.0))]
Note that the results are a list of Location
objects, where each Location
knows its name
, latitude
and longitude
.
Let’s define and test a geolocate
function, storing the result in a variable
def geolocate(place):
return geocoder.geocode(place, exactly_one=False)[0][1]
london_location = geolocate("London")
print(london_location)
(51.4893335, -0.14405508452768728)
2.5.2 Using the Yandex API#
The Yandex API allows us to fetch a map of a place, given a longitude and latitude. The URLs look like: https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US We’ll probably end up working out these URLs quite a bit. So we’ll make ourselves another function to build up a URL given our parameters.
import requests
def request_map_at(lat, long, satellite=True, zoom=12, size=(400, 400)):
base = "https://static-maps.yandex.ru/1.x/?"
params = dict(
z=zoom,
size=str(size[0]) + "," + str(size[1]),
ll=str(long) + "," + str(lat),
l="sat" if satellite else "map",
lang="en_US",
)
return requests.get(base, params=params, timeout=60)
map_response = request_map_at(51.5072, -0.1275)
2.5.3 Checking our work#
Let’s see what URL we ended up with:
url = map_response.url
print(url)
https://static-maps.yandex.ru/1.x/?z=12&size=400%2C400&ll=-0.1275%2C51.5072&l=sat&lang=en_US
We can write automated tests so that if we change our code later, we can check the results are still valid.
assert "https://static-maps.yandex.ru/1.x/?" in url
assert "ll=-0.1275%2C51.5072" in url
assert "z=12" in url
assert "size=400%2C400" in url
Our previous function comes back with an object
representing the web request.
In object oriented programming, we use the .
operator to get access to a particular property of the object, in this case, the actual image at that URL is in the content
property.
It’s a big file, so we’ll just show the first few characters here:
map_response.content[0:20]
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00'
2.5.4 Displaying the results#
We’ll need to do this a lot, so let’s wrap up our previous function in another function, to save on typing.
def map_at(*args, **kwargs):
return request_map_at(*args, **kwargs).content
We can use a library that comes with Jupyter notebook to display the image. Being able to work with variables which contain images, or documents, or any other weird kind of data, just as easily as we can with numbers or letters, is one of the really powerful things about modern programming languages like Python.
from IPython.display import Image
map_png = map_at(*london_location)
print("The type of our map result is actually a: ", type(map_png))
The type of our map result is actually a: <class 'bytes'>
Image(map_png)
Image(map_at(*geolocate("New Delhi")))
2.5.5 Measuring urbanisation#
Now we get to our research project: we want to find out how urbanised the world is. For this we’ll use satellite imagery, along a line between two cities. We expect the satellite image to be greener in the countryside.
Let’s start by importing the libraries we need.
from io import BytesIO # A library to convert between files and strings
import imageio # A library to deal with images, https://pypi.org/project/imageio/
import numpy as np # A library to deal with matrices
and then define what we count as green:
def is_green(pixels):
threshold = 1.1
greener_than_red = pixels[:, :, 1] > threshold * pixels[:, :, 0]
greener_than_blue = pixels[:, :, 1] > threshold * pixels[:, :, 2]
green = np.logical_and(greener_than_red, greener_than_blue)
return green
This code has assumed we have our pixel data for the image as a \(400 \times 400 \times 3\) 3-d matrix, with each of the three layers being red, green, and blue pixels.
We find out which pixels are green by comparing, element-by-element, the middle (green, number 1) layer to the top (red, zero) and bottom (blue, 2)
Now we just need to read in our data, which is a PNG image, and convert it into our matrix format:
def count_green_in_png(data):
f = BytesIO(data)
pixels = imageio.v2.imread(f) # Get our PNG image as a numpy array
return np.sum(is_green(pixels))
print(count_green_in_png(map_at(*london_location)))
3674
We’ll also need a function to get an evenly spaced set of places between two endpoints:
def location_sequence(start, end, steps):
lats = np.linspace(start[0], end[0], steps) # "Linearly spaced" data
longs = np.linspace(start[1], end[1], steps)
return np.vstack([lats, longs]).transpose()
location_sequence(geolocate("London"), geolocate("Cambridge"), 5)
array([[ 5.14893335e+01, -1.44055085e-01],
[ 5.16683830e+01, -7.83753884e-02],
[ 5.18474324e+01, -1.26956923e-02],
[ 5.20264819e+01, 5.29840039e-02],
[ 5.22055314e+01, 1.18663700e-01]])
2.5.6 Visualising green content#
We should display the green content to check our work:
def show_green_in_png(data):
pixels = imageio.imread(BytesIO(data)) # Get our PNG image as rows of pixels
green = is_green(pixels)
out = green[:, :, np.newaxis] * np.array([0, 1, 0])[np.newaxis, np.newaxis, :]
buffer = BytesIO()
imageio.imwrite(buffer, out, format="png")
return buffer.getvalue()
Image(map_at(*london_location, satellite=True))
Image(show_green_in_png(map_at(*london_location, satellite=True)))
/tmp/ipykernel_11600/3094229650.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
pixels = imageio.imread(BytesIO(data)) # Get our PNG image as rows of pixels
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/PIL/Image.py:3130, in fromarray(obj, mode)
3129 try:
-> 3130 mode, rawmode = _fromarray_typemap[typekey]
3131 except KeyError as e:
KeyError: ((1, 1, 3), '<i8')
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[23], line 1
----> 1 Image(show_green_in_png(map_at(*london_location, satellite=True)))
Cell In[21], line 8, in show_green_in_png(data)
5 out = green[:, :, np.newaxis] * np.array([0, 1, 0])[np.newaxis, np.newaxis, :]
7 buffer = BytesIO()
----> 8 imageio.imwrite(buffer, out, format="png")
9 return buffer.getvalue()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/imageio/v2.py:397, in imwrite(uri, im, format, **kwargs)
395 imopen_args["legacy_mode"] = True
396 with imopen(uri, "wi", **imopen_args) as file:
--> 397 return file.write(im, **kwargs)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/imageio/plugins/pillow.py:444, in PillowPlugin.write(self, ndimage, mode, format, is_batch, **kwargs)
441 ndimage = ndimage[None, ...]
443 for frame in ndimage:
--> 444 pil_frame = Image.fromarray(frame, mode=mode)
445 if "bits" in kwargs:
446 pil_frame = pil_frame.quantize(colors=2 ** kwargs["bits"])
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/PIL/Image.py:3134, in fromarray(obj, mode)
3132 typekey_shape, typestr = typekey
3133 msg = f"Cannot handle this data type: {typekey_shape}, {typestr}"
-> 3134 raise TypeError(msg) from e
3135 else:
3136 rawmode = mode
TypeError: Cannot handle this data type: (1, 1, 3), <i8
2.5.7 Looping#
We can loop over each element in our list of coordinates, and get a map for that place:
for location in location_sequence(geolocate("London"), geolocate("Birmingham"), 4):
display(Image(map_at(*location)))
So now we can count the green from London to Birmingham!
[
count_green_in_png(map_at(*location))
for location in location_sequence(geolocate("London"), geolocate("Birmingham"), 10)
]
[3258, 37141, 59293, 22398, 30123, 56351, 60224, 53067, 137132, 143993]
2.5.8 Plotting graphs#
Let’s plot a graph.
import matplotlib.pyplot as plt
plt.plot(
[
count_green_in_png(map_at(*location))
for location in location_sequence(
geolocate("London"), geolocate("Birmingham"), 10
)
]
)
[<matplotlib.lines.Line2D at 0x10aedd2e0>]
From a research perspective, of course, this code needs a lot of work. But I hope the power of using programming is clear.
2.5.9 Composing Program Elements#
We built little pieces of useful code, to:
Find latitude and longitude of a place
Get a map at a given latitude and longitude
Decide whether a (red,green,blue) triple is mainly green
Decide whether each pixel is mainly green
Plot a new image showing the green places
Find evenly spaced points between two places
By putting these together, we can make a function which can plot this graph automatically for any two places:
def green_between(start, end, steps):
return [
count_green_in_png(map_at(*location))
for location in location_sequence(geolocate(start), geolocate(end), steps)
]
plt.plot(green_between("New York", "Chicago", 20))
[<matplotlib.lines.Line2D at 0x10af8c070>]
And that’s it! We’ve used Python to analyse data from an internet API and visualise it in interesting ways.