3.1 Field and Record Data#

Estimated time to complete this notebook: 20 minutes

3.1.1 Separated Value Files#

Let’s go back to the sunspots example from the previous module. We had downloaded some semicolon separated data and decided it was better to use a library than to write our own parser.

import requests

spots = requests.get("http://www.sidc.be/silso/INFO/snmtotcsv.php", timeout=60)
spots.text.split("\n")[0]
'1749;01;1749.042;  96.7; -1.0;   -1;1'

We want to work programmatically with Separated Value files.

These are files which have:

  • Each record on a line

  • Each record has multiple fields

  • Fields are separated by some separator

Typical separators are the space, tab, comma, and semicolon separated values files, e.g.:

  • Space separated value (e.g. field1 "field two" field3 )

  • Comma separated value (e.g. field1, another field, "wow, another field")

Comma-separated-value is abbreviated CSV, and tab separated value TSV.

CSV is also used to refer to all the different sub-kinds of separated value files, i.e. some people use CSV to refer to tab, space and semicolon separated files.

CSV is not a particularly great data format, because it forces your data model to be a list of lists. Richer file formats describe “serialisations” for dictionaries and for deeper-than-two nested list structures as well.

Nevertheless, CSV files are very popular because you can always export spreadsheets as CSV files, (each cell is a field, each row is a record)

3.1.2 CSV variants#

Some CSV formats define a comment character, so that rows beginning with, e.g., a #, are not treated as data, but give a human comment.

Some CSV formats define a three-deep list structure, where a double-newline separates records into blocks.

Some CSV formats assume that the first line defines the names of the fields, e.g.:

name, age
James, 39
Will, 2

3.1.3 Python CSV readers#

The Python standard library has a csv module. However, it’s less powerful than the CSV capabilities in other libraries such as numpy. Here we will use pandas which is built on top of numpy.

import pandas as pd
df = pd.read_csv("http://www.sidc.be/silso/INFO/snmtotcsv.php", sep=";", header=None)
df.head()
0 1 2 3 4 5 6
0 1749 1 1749.042 96.7 -1.0 -1 1
1 1749 2 1749.123 104.3 -1.0 -1 1
2 1749 3 1749.204 116.7 -1.0 -1 1
3 1749 4 1749.288 92.8 -1.0 -1 1
4 1749 5 1749.371 141.7 -1.0 -1 1

Pandas read_csv is a powerful CSV reader tool. A path to the data is given, this can be something on a local machine, or in this case the path is a url.

We used the sep optional argument to specify the delimeter. The optional argument header specifies if the data contains headers, and if so; the row numbers to use as column names.

The data is loaded into a DataFrame. The head method shows us the first 5 entries in the dataframe. The tail method shows us the last 5 entries.

df.tail()
0 1 2 3 4 5 6
3295 2023 8 2023.623 114.8 15.0 1247 1
3296 2023 9 2023.705 134.2 18.3 1261 1
3297 2023 10 2023.790 99.4 16.0 958 0
3298 2023 11 2023.873 105.4 16.7 809 0
3299 2023 12 2023.958 114.2 17.9 619 0
df[3][0]
96.7

We can now plot the “Sunspot cycle”:

df.plot(x=2, y=3)
<Axes: xlabel='2'>
../_images/03_01_fields_and_records_24_1.png

The plot command accepted an series of ‘X’ values and an series of ‘Y’ values, identified by their column number in this case, as the dataframe does not have (useful) column headers yet.

3.1.4 Naming Columns#

As it happens, the columns definitions can be found on the source website (http://www.sidc.be/silso/infosnmtot)

CSV

Filename: SN_m_tot_V2.0.csv Format: Comma Separated values (adapted for import in spreadsheets) The separator is the semicolon ‘;’.

Contents:

  • Column 1-2: Gregorian calendar date

    • Year

    • Month

  • Column 3: Date in fraction of year.

  • Column 4: Monthly mean total sunspot number.

  • Column 5: Monthly mean standard deviation of the input sunspot numbers.

  • Column 6: Number of observations used to compute the monthly mean total sunspot number.

  • Column 7: Definitive/provisional marker. ‘1’ indicates that the value is definitive. ‘0’ indicates that the value is still provisional.

We can actually specify this to the formatter:

df_w_names = pd.read_csv(
    "http://www.sidc.be/silso/INFO/snmtotcsv.php",
    sep=";",
    header=None,
    names=["year", "month", "date", "mean", "deviation", "observations", "definitive"],
)
df_w_names.head()
year month date mean deviation observations definitive
0 1749 1 1749.042 96.7 -1.0 -1 1
1 1749 2 1749.123 104.3 -1.0 -1 1
2 1749 3 1749.204 116.7 -1.0 -1 1
3 1749 4 1749.288 92.8 -1.0 -1 1
4 1749 5 1749.371 141.7 -1.0 -1 1
df_w_names.plot(x="date", y="mean")
<Axes: xlabel='date'>
../_images/03_01_fields_and_records_31_1.png

Note: The plot method used for the DataFrame is just a wrapper around the matplotlib function plt.plot():

3.1.5 Typed Fields#

It’s also often useful to check, and if necessary specify, the datatype of each field.

df_w_names.dtypes  # Check the data types of all columns in the DataFrame
year              int64
month             int64
date            float64
mean            float64
deviation       float64
observations      int64
definitive        int64
dtype: object

In this case the data types seem sensible, however if we wanted to convert the year into a floating point number instead, we could via:

df_w_names["year"] = df_w_names["year"].astype("float64")
df_w_names.dtypes
year            float64
month             int64
date            float64
mean            float64
deviation       float64
observations      int64
definitive        int64
dtype: object
df_w_names.head()
year month date mean deviation observations definitive
0 1749.0 1 1749.042 96.7 -1.0 -1 1
1 1749.0 2 1749.123 104.3 -1.0 -1 1
2 1749.0 3 1749.204 116.7 -1.0 -1 1
3 1749.0 4 1749.288 92.8 -1.0 -1 1
4 1749.0 5 1749.371 141.7 -1.0 -1 1

3.1.6 Filtering data#

Sometimes it is necessary to filter data, for example to only see the sunspots for the year 2018 you would use:

df_twenty_eighteen = df_w_names[(df_w_names["year"] == 2018)]
df_twenty_eighteen.head(20)
year month date mean deviation observations definitive
3228 2018.0 1 2018.042 6.8 1.5 701 1
3229 2018.0 2 2018.122 10.7 1.1 917 1
3230 2018.0 3 2018.204 2.5 0.4 1081 1
3231 2018.0 4 2018.286 8.9 1.3 996 1
3232 2018.0 5 2018.371 13.1 1.6 1234 1
3233 2018.0 6 2018.453 15.6 1.6 1070 1
3234 2018.0 7 2018.538 1.6 0.6 1438 1
3235 2018.0 8 2018.623 8.7 1.0 1297 1
3236 2018.0 9 2018.705 3.3 0.6 1223 1
3237 2018.0 10 2018.790 4.9 1.2 1097 1
3238 2018.0 11 2018.873 4.9 0.6 771 1
3239 2018.0 12 2018.958 3.1 0.5 786 1

Even though we used

df_twenty_eighteen.head(20)

to show us the first 20 results from the dataframe, only 12 are shown as there are only 12 months in a year

If we wanted all data from 1997 to 1999 we could via:

df_nineties = df_w_names[(df_w_names["year"] >= 1997) & (df_w_names["year"] < 2000)]
df_nineties.head()
year month date mean deviation observations definitive
2976 1997.0 1 1997.042 7.4 3.2 497 1
2977 1997.0 2 1997.123 11.0 2.9 545 1
2978 1997.0 3 1997.204 12.1 2.4 627 1
2979 1997.0 4 1997.288 23.0 3.3 663 1
2980 1997.0 5 1997.371 25.4 2.8 716 1
df_nineties.tail()
year month date mean deviation observations definitive
3007 1999.0 8 1999.623 142.3 12.9 649 1
3008 1999.0 9 1999.707 106.3 6.5 624 1
3009 1999.0 10 1999.790 168.7 10.4 531 1
3010 1999.0 11 1999.874 188.3 12.3 406 1
3011 1999.0 12 1999.958 116.8 9.3 404 1