What can go wrong?

Consider a simple python computational model of chemical reaction networks:

class Element:
    def __init__(self, symbol, number):
        self.symbol = symbol
        self.number = number

    def __str__(self):
        return str(self.symbol)

class Molecule:
    def __init__(self, mass):
        self.elements = {}  # Map from element to number of that element in the molecule
        self.mass = mass

    def add_element(self, element, number):
        self.elements[element] = number

    def as_subscript(number):
        if number == 1:
            return ""
        if number < 10:
            return "_" + str(number)
        return "_{" + str(number) + "}"

    def __str__(self):
        return "".join(
                str(element) + Molecule.as_subscript(self.elements[element])
                for element in self.elements

class Reaction:
    def __init__(self):
        self.reactants = {}  # Map from reactants to stoichiometries
        self.products = {}  # Map from products to stoichiometries

    def add_reactant(self, reactant, stoichiometry):
        self.reactants[reactant] = stoichiometry

    def add_product(self, product, stoichiometry):
        self.products[product] = stoichiometry

    def print_if_not_one(number):
        if number == 1:
            return ""
            return str(number)

    def side_as_string(side):
        return " + ".join(
                Reaction.print_if_not_one(side[molecule]) + str(molecule)
                for molecule in side

    def __str__(self):
        return (
            + " \\rightarrow "
            + Reaction.side_as_string(self.products)

class System:
    def __init__(self):
        self.reactions = []

    def add_reaction(self, reaction):

    def __str__(self):
        return "\n".join(self.reactions)
c = Element("C", 12)
o = Element("O", 8)
h = Element("H", 1)

co2 = Molecule(44.01)
co2.add_element(c, 1)
co2.add_element(o, 2)

h2o = Molecule(18.01)
h2o.add_element(h, 2)
h2o.add_element(o, 1)

o2 = Molecule(32.00)
o2.add_element(o, 2)

glucose = Molecule(180.16)
glucose.add_element(c, 6)
glucose.add_element(h, 12)
glucose.add_element(o, 6)

combustion = Reaction()
combustion.add_reactant(glucose, 1)
combustion.add_reactant(o2, 6)
combustion.add_product(co2, 6)
combustion.add_product(h2o, 6)

C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O
from IPython.display import display, Math

\[\displaystyle C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O\]

We could reasonably consider using the LaTeX representation of this as a fileformat for reactions. (Though we need to represent the molecular mass in some way we’ve not thought of.)

We’ve already shown how to serialise the data to this representation. How hard would it be to deserialise it?

Actually, pretty darn hard.

In the wild, such datafiles will have all kinds of complications, and making a hand-coded string parser for such text will be highly problematic. In this lecture, we’re going to look at the kind of problems that can arise, and some standard ways to solve them, which will lead us to the idea of normalisation in databases.

Next lecture, we’ll look at how we might create a file format which does indeed look like such a fluent mathematical representation, which we’ll call a Domain Specific Language.

Non-normal data representations: First normal form.

Consider the mistakes that someone might make when typing in a reaction in the above format: they could easily, if there are multiple reactions in a system, type glucose in correctly as C_6H_{12}O_6 the first time, but the second type accidentally type C_6H_{12}o_6.

The system wouldn’t know these are the same molecule, so, for example, if building a mass action model of reaction kinetics, the differential equations would come out wrong.

The natural-seeming solution to this is, in your data format, to name each molecule and atom, and consider a representation in terms of CSV files:

%%writefile molecules.csv
# name, elements, numbers

water, H O, 2 1
oxygen, O, 2
carbon_dioxide, C O, 1 2
glucose, C H O, 6 12 6
Overwriting molecules.csv
%%writefile reactions.csv

# name, reactants, products, reactant_stoichiometries, product_stoichiometries

combustion_of_glucose, glucose oxygen, carbon_dioxide water, 1 6, 6 6
Overwriting reactions.csv

Writing a parser for these files would be very similar to the earthquake problem that you’ve already encountered.

However, the existence of multiple values in one column indicates that this file format is NOT first normal form (1NF).

Note: A table is in first normal form if: every column is unique; no rows are duplicated; each column/row intersection contains only one value.

It is not uncommon to encounter file-formats that violate 1NF in the wild. The main problem with them is that you will eventually have to deal with the separation character that you picked (; in this case) turning up in someone’s content and you’ll need to work out how to escape it.

The art of designing serialisations which work as row-and-column value tables for more complex data structures is the core of database design.

Normalising the reaction model - a bad first attempt.

How could we go about normalising this model. One choice is to list each molecule-element relation as a separate table row:

%%writefile molecules.csv
# name, element, number

water, H, 2
water, O, 1
oxygen, O, 2
carbon_dioxide, C, 1
carbon_dioxide, O, 2
Overwriting molecules.csv

This is fine as far as it goes, but, it falls down as soon as we want to associate another property with a molecule and atom.

We could repeat the data each time:

%%writefile molecules.csv
# name, element, number, molecular_mass, atomic_number

water, H, 2, 18.01, 1
water, O, 1, 18.01, 8
oxygen, O, 2, 32.00, 8
Overwriting molecules.csv

The existence of the same piece of information in multiple locations (eg. the 18.01 molecular mass of water) indicates that this file format is NOT second normal form (2NF).

However, this would allow our data file to be potentially be self-inconsistent, violating the design principle that each piece of information should be stated only once. A data structure of this type is said to be NOT second normal form.

Note: A table is in second normal form if: it is in first normal form; none of its attributes depend on any other attribute except the primary key.

Normalising the model - relations and keys

So, how do we do this correctly?

We need to specify data about each molecule, reaction and atom once, and then specify the relations between them.

%%writefile molecules.csv
# name, molecular_mass

water, 18.01
oxygen, 32.00
Overwriting molecules.csv
%%writefile atoms.csv

# symbol, atomic number
H, 1
O, 8
C, 6
Overwriting atoms.csv
%%writefile atoms_in_molecules.csv

# rel_number, molecule, symbol, number
0, water, H, 2
1, water, O, 1
2, oxygen, O, 2
Overwriting atoms_in_molecules.csv

This last table is called a join table - and is needed whenever we want to specify a “many-to-many” relationship. (Each atom can be in more than one molecule, and each molecule has more than one atom.)

Note each table needs to have a set of columns which taken together form a unique identifier for that row; called a “key”. If more than one is possible, we choose one and call it a primary key. (And in practice, we normally choose a single column for this: hence the ‘rel_number’ column, though the tuple {molecule, symbol} here is another candidate key.)

Now, proper database tools use much more sophisticated representations than just csv files - including indices to enable hash-table like efficient lookups, and support for managing multiple users at the same time.

However, the principles of database normalisation and the relational model will be helpful right across our thinking about data representation, whether these are dataframes in Pandas, tensors in tensorflow, or anything else…

Making a database - SQLite

Let’s look at how we would use a simple database in Python to represent atoms and molecules. If you’ve never seen SQL before, you might want to attend an introductory course, such as one of the ‘Software Carpentry’ sessions. Here we’re going to assume some existing knowledge but we will use a Python-style way to interact with databases instead of relying on raw SQL.

import os

    print("Removing database to start again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
Removing database to start again from scratch
import sqlalchemy

engine = sqlalchemy.create_engine("sqlite:///molecules.db", echo=True)

SQLite is a simple very-lightweight database tool - without support for concurrent users - but it’s great for little hacks like this. For full-on database work you’ll probably want to use a more fully-featured database like https://www.postgresql.org.

The metadata for the database describing the tables present, and their columns, is defined in Python using SQLAlchemy, the leading python database tool, thus:

from sqlalchemy import Table, Column, Integer, Float, String, MetaData, ForeignKey

metadata = MetaData()
molecules = Table(
    Column("name", String, primary_key=True),
    Column("mass", Float),

atoms = Table(
    Column("symbol", String, primary_key=True),
    Column("number", Integer),
atoms_in_molecules = Table(
    Column("atom", ForeignKey("atoms.symbol")),
    Column("molecule", ForeignKey("molecules.name")),
    Column("number", Integer),
2022-08-08 15:27:09,732 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-08-08 15:27:09,734 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("molecules")
2022-08-08 15:27:09,735 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,738 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("molecules")
2022-08-08 15:27:09,739 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,741 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("atoms")
2022-08-08 15:27:09,741 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,743 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("atoms")
2022-08-08 15:27:09,744 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,746 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("atoms_molecules")
2022-08-08 15:27:09,747 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,747 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("atoms_molecules")
2022-08-08 15:27:09,749 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-08-08 15:27:09,751 INFO sqlalchemy.engine.Engine 
CREATE TABLE molecules (
	mass FLOAT, 
2022-08-08 15:27:09,751 INFO sqlalchemy.engine.Engine [no key 0.00054s] ()
2022-08-08 15:27:09,755 INFO sqlalchemy.engine.Engine 
	number INTEGER, 
	PRIMARY KEY (symbol)
2022-08-08 15:27:09,756 INFO sqlalchemy.engine.Engine [no key 0.00067s] ()
2022-08-08 15:27:09,759 INFO sqlalchemy.engine.Engine 
CREATE TABLE atoms_molecules (
	atom VARCHAR, 
	molecule VARCHAR, 
	number INTEGER, 
	FOREIGN KEY(atom) REFERENCES atoms (symbol), 
	FOREIGN KEY(molecule) REFERENCES molecules (name)
2022-08-08 15:27:09,761 INFO sqlalchemy.engine.Engine [no key 0.00150s] ()
2022-08-08 15:27:09,765 INFO sqlalchemy.engine.Engine COMMIT

Note the SQL syntax for creating tables is generated by the python tool, and sent to the database server.

CREATE TABLE molecules (
	mass FLOAT, 

We’ll turn off our automatic printing of all the raw sql to avoid this notebook being unreadable.

engine.echo = False

We can also write data to our database using this python tooling:

ins = molecules.insert().values(name="water", mass="18.01")
conn = engine.connect()
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f14dcfa05e0>

And query it:

from sqlalchemy.sql import select

s = select([molecules])
result = conn.execute(s)

If we have enough understanding of SQL syntax, we can use appropriate join statements to find, for example, the mass of all molecules which contain oxygen:

conn.execute(molecules.insert().values(name="oxygen", mass="32.00"))
conn.execute(atoms.insert().values(symbol="O", number=8))
conn.execute(atoms.insert().values(symbol="H", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="water", atom="O", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="oxygen", atom="O", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="water", atom="H", number=2))
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f14dcf03ee0>
result = conn.execute(
    SELECT mass
    FROM   molecules
          JOIN atoms_molecules
            ON molecules.NAME = atoms_molecules.molecule
          JOIN atoms
            ON atoms.symbol = atoms_molecules.atom
    WHERE  atoms.symbol = 'H'

But we can do much better…

Data and Objects - the Object-Relational-Mapping

We notice that when we find a correct relational model for our data, many of the rows are suggestive of exactly the data we would expect to supply to an object constructor - data about an object. References to keys of other tables in rows suggest composition relations while many-to-many join tables often represent aggregation relationships, and data about the relationship.

As a result of this, powerful tools exist to automatically create object structures from database schema, including saving and loading.

import os

    print("Removing database to start again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
Removing database to start again from scratch
import sqlalchemy

engine = sqlalchemy.create_engine("sqlite:///molecules.db")
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Element(Base):
    __tablename__ = "atoms"
    symbol = Column(String, primary_key=True)
    number = Column(Integer)
    molecules = relationship("AtomsPerMolecule", backref="atom")
class Molecule(Base):
    __tablename__ = "molecules"
    name = Column(String, primary_key=True)
    mass = Column(Float)
    atoms = relationship("AtomsPerMolecule", backref="molecule")
class AtomsPerMolecule(Base):
    __tablename__ = "atoms_per_molecule"
    id = Column(Integer, primary_key=True)
    atom_id = Column(None, ForeignKey("atoms.symbol"))
    molecule_id = Column(None, ForeignKey("molecules.name"))
    number = Column(Integer)

If we now create our tables, the system will automatically create a DB:

engine.echo = False

And we can create objects with a simple interface that looks just like ordinary classes:

oxygen = Element(symbol="O", number=8)
hydrogen = Element(symbol="H", number=1)
elements = [oxygen, hydrogen]
water = Molecule(name="water", mass=18.01)
oxygen_m = Molecule(name="oxygen", mass=16.00)
hydrogen_m = Molecule(name="hydrogen", mass=2.02)
molecules = [water, oxygen_m, hydrogen_m]
# Note that we are using the `backref` name to construct the `atom_id` and `molecule_id`.
# These lookup instances of Element and Molecule that are already in our database
amounts = [
    AtomsPerMolecule(atom=oxygen, molecule=water, number=1),
    AtomsPerMolecule(atom=hydrogen, molecule=water, number=2),
    AtomsPerMolecule(atom=oxygen, molecule=oxygen_m, number=2),
    AtomsPerMolecule(atom=hydrogen, molecule=hydrogen_m, number=2),
from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine)
session = Session()
session.bulk_save_objects(elements + molecules + amounts)

This is a very powerful technique - we get our class-type interface in python, with database persistence and searchability for free!

Moving on from databases

Databases are often a good choice for storing data, but can only be interacted with programmatically. Often, we want to make a file format to represent our dataset which can be easily replicated or shared. The next part of this module focuses on the design of such file-formats, both binary and human-readable.

One choice, now we know about it, is to serialise all the database tables as CSV:

import pandas
'SELECT molecules.name, molecules.mass \nFROM molecules'
dataframe = pandas.read_sql(session.query(Molecule).statement, session.bind)
name mass
0 water 18.01
1 oxygen 16.00
2 hydrogen 2.02

Deserialising is also easy:

%%writefile atoms.csv

Overwriting atoms.csv
from pathlib import Path

atoms = pandas.read_csv(open("atoms.csv"))
symbol number
0 C 6
1 N 7
atoms.to_sql("atoms", session.bind, if_exists="append", index=False)

However, we know from last term that another common choice is to represent such complicated data structures in YAML. The implications of what we’ve just learned for serialising to and from such structured data is the topic of the next lecture.