2.5 Data analysis example#

Estimated time for this notebook: 20 minutes

We’re now going to bring together everything we’ve learned about Python so far to perform a simple but complete analysis. We will retrieve data, do some computations based on it, and visualise the results.

As we show the code for different parts of the work, we will be touching on various aspects you may want to keep in mind, either related to Python specifically, or to research programming more generally.

2.5.1 Geolocation#

import geopy  # A python library for investigating geographic information. https://pypi.org/project/geopy/

If you try to follow along on this example in an Jupyter notebook, you might find that you just got an error message.

You’ll need to wait until we’ve covered installation of additional python libraries later in the course, then come back to this and try again. For now, just follow along and try get the feel for how programming for data-focused research works.

geocoder = geopy.geocoders.Nominatim(user_agent="rse-course")
geocoder.geocode("Cambridge", exactly_one=False)
[Location(Cambridge, Cambridgeshire, Cambridgeshire and Peterborough, England, United Kingdom, (52.2055314, 0.1186637, 0.0)),
 Location(Cambridge, Middlesex County, Massachusetts, United States, (42.3655767, -71.1040018, 0.0)),
 Location(Cambridge, Region of Waterloo, Ontario, Canada, (43.3600536, -80.3123023, 0.0)),
 Location(Cambridge, Henry County, Illinois, United States, (41.3025257, -90.1962861, 0.0)),
 Location(Cambridge, Isanti County, Minnesota, 55008, United States, (45.5727408, -93.2243921, 0.0)),
 Location(Cambridge, Story County, Iowa, United States, (41.8990768, -93.5294029, 0.0)),
 Location(Cambridge, Dorchester County, Maryland, 21613, United States, (38.5714624, -76.0763177, 0.0)),
 Location(Cambridge, Guernsey County, Ohio, 43725, United States, (40.031183, -81.5884561, 0.0)),
 Location(Cambridge, Jefferson County, Kentucky, United States, (38.2217369, -85.616627, 0.0)),
 Location(Cambridge, Cowley County, Kansas, United States, (37.316988, -96.66633224663678, 0.0))]

Note that the results are a list of Location objects, where each Location knows its name, latitude and longitude.

Let’s define and test a geolocate function, storing the result in a variable

def geolocate(place):
    return geocoder.geocode(place, exactly_one=False)[0][1]
london_location = geolocate("London")
print(london_location)
(51.5074456, -0.1277653)

2.5.2 Using the Yandex API#

The Yandex API allows us to fetch a map of a place, given a longitude and latitude. The URLs look like: https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US We’ll probably end up working out these URLs quite a bit. So we’ll make ourselves another function to build up a URL given our parameters.

import requests


def request_map_at(lat, long, satellite=True, zoom=12, size=(400, 400)):
    base = "https://static-maps.yandex.ru/1.x/?"
    params = dict(
        z=zoom,
        size=str(size[0]) + "," + str(size[1]),
        ll=str(long) + "," + str(lat),
        l="sat" if satellite else "map",
        lang="en_US",
    )
    return requests.get(base, params=params, timeout=60)
map_response = request_map_at(51.5072, -0.1275)

2.5.3 Checking our work#

Let’s see what URL we ended up with:

url = map_response.url
print(url)
https://static-maps.yandex.ru/1.x/?z=12&size=400%2C400&ll=-0.1275%2C51.5072&l=sat&lang=en_US

We can write automated tests so that if we change our code later, we can check the results are still valid.

assert "https://static-maps.yandex.ru/1.x/?" in url
assert "ll=-0.1275%2C51.5072" in url
assert "z=12" in url
assert "size=400%2C400" in url

Our previous function comes back with an object representing the web request. In object oriented programming, we use the . operator to get access to a particular property of the object, in this case, the actual image at that URL is in the content property. It’s a big file, so we’ll just show the first few characters here:

map_response.content[0:20]
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00'

2.5.4 Displaying the results#

We’ll need to do this a lot, so let’s wrap up our previous function in another function, to save on typing.

def map_at(*args, **kwargs):
    return request_map_at(*args, **kwargs).content

We can use a library that comes with Jupyter notebook to display the image. Being able to work with variables which contain images, or documents, or any other weird kind of data, just as easily as we can with numbers or letters, is one of the really powerful things about modern programming languages like Python.

from IPython.display import Image

map_png = map_at(*london_location)
print("The type of our map result is actually a: ", type(map_png))
The type of our map result is actually a:  <class 'bytes'>
Image(map_png)
../_images/02_05_data_analysis_example_28_0.jpg
Image(map_at(*geolocate("New Delhi")))
../_images/02_05_data_analysis_example_29_0.jpg

2.5.5 Measuring urbanisation#

Now we get to our research project: we want to find out how urbanised the world is. For this we’ll use satellite imagery, along a line between two cities. We expect the satellite image to be greener in the countryside.

Let’s start by importing the libraries we need.

from io import BytesIO  # A library to convert between files and strings

import imageio  # A library to deal with images, https://pypi.org/project/imageio/
import numpy as np  # A library to deal with matrices

and then define what we count as green:

def is_green(pixels):
    threshold = 1.1
    greener_than_red = pixels[:, :, 1] > threshold * pixels[:, :, 0]
    greener_than_blue = pixels[:, :, 1] > threshold * pixels[:, :, 2]
    green = np.logical_and(greener_than_red, greener_than_blue)
    return green

This code has assumed we have our pixel data for the image as a \(400 \times 400 \times 3\) 3-d matrix, with each of the three layers being red, green, and blue pixels.

We find out which pixels are green by comparing, element-by-element, the middle (green, number 1) layer to the top (red, zero) and bottom (blue, 2)

Now we just need to read in our data, which is a PNG image, and convert it into our matrix format:

def count_green_in_png(data):
    f = BytesIO(data)
    pixels = imageio.v2.imread(f)  # Get our PNG image as a numpy array
    return np.sum(is_green(pixels))
print(count_green_in_png(map_at(*london_location)))
3006

We’ll also need a function to get an evenly spaced set of places between two endpoints:

def location_sequence(start, end, steps):
    lats = np.linspace(start[0], end[0], steps)  # "Linearly spaced" data
    longs = np.linspace(start[1], end[1], steps)
    return np.vstack([lats, longs]).transpose()
location_sequence(geolocate("London"), geolocate("Cambridge"), 5)
array([[ 5.15074456e+01, -1.27765300e-01],
       [ 5.16819670e+01, -6.61580500e-02],
       [ 5.18564885e+01, -4.55080000e-03],
       [ 5.20310099e+01,  5.70564500e-02],
       [ 5.22055314e+01,  1.18663700e-01]])

2.5.6 Visualising green content#

We should display the green content to check our work:

def show_green_in_png(data):
    pixels = imageio.imread(BytesIO(data))  # Get our PNG image as rows of pixels
    green = is_green(pixels)

    out = green[:, :, np.newaxis] * np.array([0, 1, 0])[np.newaxis, np.newaxis, :]

    buffer = BytesIO()
    imageio.imwrite(buffer, out, format="png")
    return buffer.getvalue()
Image(map_at(*london_location, satellite=True))
../_images/02_05_data_analysis_example_46_0.jpg
Image(show_green_in_png(map_at(*london_location, satellite=True)))
/tmp/ipykernel_11660/3094229650.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(BytesIO(data))  # Get our PNG image as rows of pixels
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/PIL/Image.py:3098, in fromarray(obj, mode)
   3097 try:
-> 3098     mode, rawmode = _fromarray_typemap[typekey]
   3099 except KeyError as e:

KeyError: ((1, 1, 3), '<i8')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 Image(show_green_in_png(map_at(*london_location, satellite=True)))

Cell In[21], line 8, in show_green_in_png(data)
      5 out = green[:, :, np.newaxis] * np.array([0, 1, 0])[np.newaxis, np.newaxis, :]
      7 buffer = BytesIO()
----> 8 imageio.imwrite(buffer, out, format="png")
      9 return buffer.getvalue()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/imageio/v2.py:397, in imwrite(uri, im, format, **kwargs)
    395 imopen_args["legacy_mode"] = True
    396 with imopen(uri, "wi", **imopen_args) as file:
--> 397     return file.write(im, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/imageio/plugins/pillow.py:444, in PillowPlugin.write(self, ndimage, mode, format, is_batch, **kwargs)
    441     ndimage = ndimage[None, ...]
    443 for frame in ndimage:
--> 444     pil_frame = Image.fromarray(frame, mode=mode)
    445     if "bits" in kwargs:
    446         pil_frame = pil_frame.quantize(colors=2 ** kwargs["bits"])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/PIL/Image.py:3102, in fromarray(obj, mode)
   3100         typekey_shape, typestr = typekey
   3101         msg = f"Cannot handle this data type: {typekey_shape}, {typestr}"
-> 3102         raise TypeError(msg) from e
   3103 else:
   3104     rawmode = mode

TypeError: Cannot handle this data type: (1, 1, 3), <i8

2.5.7 Looping#

We can loop over each element in our list of coordinates, and get a map for that place:

for location in location_sequence(geolocate("London"), geolocate("Birmingham"), 4):
    display(Image(map_at(*location)))
../_images/02_05_data_analysis_example_50_0.jpg ../_images/02_05_data_analysis_example_50_1.jpg ../_images/02_05_data_analysis_example_50_2.jpg ../_images/02_05_data_analysis_example_50_3.jpg

So now we can count the green from London to Birmingham!

[
    count_green_in_png(map_at(*location))
    for location in location_sequence(geolocate("London"), geolocate("Birmingham"), 10)
]
[3258, 37141, 59293, 22398, 30123, 56351, 60224, 53067, 137132, 143993]

2.5.8 Plotting graphs#

Let’s plot a graph.

import matplotlib.pyplot as plt

plt.plot(
    [
        count_green_in_png(map_at(*location))
        for location in location_sequence(
            geolocate("London"), geolocate("Birmingham"), 10
        )
    ]
)
[<matplotlib.lines.Line2D at 0x10aedd2e0>]
../_images/02_05_data_analysis_example_55_1.png

From a research perspective, of course, this code needs a lot of work. But I hope the power of using programming is clear.

2.5.9 Composing Program Elements#

We built little pieces of useful code, to:

  • Find latitude and longitude of a place

  • Get a map at a given latitude and longitude

  • Decide whether a (red,green,blue) triple is mainly green

  • Decide whether each pixel is mainly green

  • Plot a new image showing the green places

  • Find evenly spaced points between two places

By putting these together, we can make a function which can plot this graph automatically for any two places:

def green_between(start, end, steps):
    return [
        count_green_in_png(map_at(*location))
        for location in location_sequence(geolocate(start), geolocate(end), steps)
    ]
plt.plot(green_between("New York", "Chicago", 20))
[<matplotlib.lines.Line2D at 0x10af8c070>]
../_images/02_05_data_analysis_example_61_1.png

And that’s it! We’ve used Python to analyse data from an internet API and visualise it in interesting ways.