XClose
Menu

What can go wrong?

Consider a simple python computational model of chemical reaction networks:

In [1]:
class Element:
    def __init__(self, symbol, number):
        self.symbol = symbol
        self.number = number
        
    def __str__(self):
        return str(self.symbol)
        
class Molecule:
    def __init__(self, mass):
        self.elements= {} # Map from element to number of that element in the molecule
        self.mass = mass
        
    def add_element(self, element, number):
        self.elements[element] = number
    
    @staticmethod
    def as_subscript(number):
        if number==1:
            return ""
        
        if number<10:
            return "_"+str(number)
        else:
            return "_{"+str(number)+"}"
    
    def __str__(self):
        return ''.join(
            [str(element)+Molecule.as_subscript(self.elements[element])
             for element in self.elements])
  
class Reaction:
    def __init__(self):
        self.reactants = { } # Map from reactants to stoichiometries
        self.products = { } # Map from products to stoichiometries
        
    def add_reactant(self, reactant, stoichiometry):
        self.reactants[reactant] = stoichiometry
        
    def add_product(self, product, stoichiometry):
        self.products[product] = stoichiometry
    
    @staticmethod
    def print_if_not_one(number):
        if number==1:
            return ''
        else: return str(number)
    
    @staticmethod
    def side_as_string(side):
        return " + ".join(
            [Reaction.print_if_not_one(side[molecule]) + str(molecule)
             for molecule in side])
        
    def __str__(self):
        return (Reaction.side_as_string(self.reactants)+
        " \\rightarrow "+Reaction.side_as_string(self.products))
    
class System:
    def __init__(self):
        self.reactions=[]
    def add_reaction(self, reaction):
        self.reactions.append(reaction)
        
    def __str__(self):
        return "\n".join(self.reactions)
In [2]:
c=Element("C", 12)
o=Element("O", 8)
h=Element("H", 1)

co2 = Molecule(44.01)
co2.add_element(c,1)
co2.add_element(o,2)

h2o = Molecule(18.01)
h2o.add_element(h,2)
h2o.add_element(o,1)

o2 = Molecule(32.00)
o2.add_element(o,2)

glucose = Molecule(180.16)
glucose.add_element(c,6)
glucose.add_element(h,12)
glucose.add_element(o,6)

combustion = Reaction()
combustion.add_reactant(glucose,  1)
combustion.add_reactant(o2, 6)
combustion.add_product(co2, 6)
combustion.add_product(h2o, 6)

print(combustion)
C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O
In [3]:
from IPython.display import display, Math
display(Math(str(combustion)))
$\displaystyle C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O$

We could reasonably consider using the latex representation of this as a fileformat for reactions. (Though we need to represent the molecular mass in some way we've not thought of.)

We've already shown how to serialise the data to this representation. How hard would it be to deserialise it?

Actually, pretty darn hard.

In the wild, such datafiles will have all kinds of complications, and making a hand-coded string parser for such text will be highly problematic. In this lecture, we're going to look at the kind of problems that can arise, and some standard ways to solve them, which will lead us to the idea of normalisation in databases.

Next lecture, we'll look at how we might create a file format which does indeed look like such a fluent mathematical representation, which we'll call a Domain Specific Language.

Non-normal data representations: First normal form.

Consider the mistakes that someone might make when typing in a reaction in the above format: they could easily, if there are multiple reactions in a system, type glucose in correctly as C_6H_{12}O_6 the first time, but the second type accidentally type C_6H_{12}o_6.

The system wouldn't know these are the same molecule, so, for example, if building a mass action model of reaction kinetics, the differential equations would come out wrong.

The natural-seeming solution to this is, in your data format, to name each molecule and atom, and consider a representation in terms of CSV files:

In [4]:
%%writefile molecules.csv
# name, elements, numbers

water, H;O , 1;2
oxygen, O , 2
carbon_dioxide, C;O , 1;2
glucose, C;H;O, 6;12;6
Writing molecules.csv
In [5]:
%%writefile reactions.csv

# name, reactants, products, reactant_stoichiometries, product_stoichiometries

combustion_of_glucose, glucose;oxygen, carbon_dioxide;water, 1;6, 6;6
Writing reactions.csv

Writing a parser is, of course, quite easy, but the existence of multiple values in one column is characteristic of a classic error in file-format design: such a file format is said to be NOT first normal form.

There are many reasons why this sucks, but a classic one is that eventually, your separation character within the field will turn up in someone's content, and there'll be all kinds of escaping problems.

Unfortunately, such things are found in the wild all the time.

The art of designing serialisations which work as row-and-column value tables for more complex data structures is the core of database design.

Normalising the reaction model - a bad first attempt.

How could we go about normalising this model. One choice is to list each molecule-element relation as a separate table row:

In [6]:
%%writefile molecules.csv
# name, element, number

water, H , 1
water, O, 2
oxygen, O , 2
carbon_dioxide, C , 1
carbon_dioxide, O, 2
Overwriting molecules.csv

This is fine as far as it goes, but, it falls down as soon as we want to associate another property with a molecule and atom.

We could repeat the data each time:

In [7]:
%%writefile molecules.csv
# name, element, number, molecular_mass, atomic_number

water, H , 2, 18.01, 1
water, O, 1, 18.01, 8
oxygen, O, 2, 32.00, 8
Overwriting molecules.csv

which would allow our data file to be potentially be self-inconsistent, violating the design principle that each piece of information should be stated only once. Indeed, this is so obviously stupid that it's hard to imagine doing it, but the mistake is so common in the wild that it has a name: such a structure is said to be NOT second normal form.

Normalising the model - relations and keys

So, how do we do this correctly?

We need to specify data about each molecule, reaction and atom once, and then specify the relations between them.

In [8]:
%%writefile molecules.csv
# name, molecular_mass

water, 18.01
oxygen, 32.00
Overwriting molecules.csv
In [9]:
%%writefile atoms.csv

# symbol, atomic number
H, 1
O, 8
C, 6
Writing atoms.csv
In [10]:
%%writefile atoms_in_molecules.csv

# rel_number, molecule, symbol, number
0, water, H , 1
1, water, O, 2
2, oxygen, O , 2
Writing atoms_in_molecules.csv

This last table is called a join table - and is needed whenever we want to specify a "many-to-many" relationship. (Each atom can be in more than one molecule, and each molecule has more than one atom.)

Note each table needs to have a set of columns which taken together form a unique identifier for that row; called a "key". If more than one is possible, we choose one and call it a primary key. (And in practice, we normally choose a single column for this: hence the 'rel_number' column, though the tuple {molecule, symbol} here is another candidate key.)

Now, proper database tools use much more sophisticated representations than just csv files - including indices to enable hash-table like efficient lookups, and support for managing multiple users at the same time.

However, the principles of database normalisation and the relational model will be helpful right across our thinking about data representation, whether these are dataframes in Pandas, tensors in tensorflow, or anything else...

Making a database - SQLite

Let's look at how we would use a simple database in Python to represent atoms and molecules. If you've never seen SQL before, you should attend one of the 'software carpentry' sessions. However, we expect you have, so this builds on raw sql to look at a more python-style way to interact with databases.

In [11]:
import os
try:
    os.remove('molecules.db')
    print("Remove database to teach again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
No DB since this notebook was last run
In [12]:
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:///molecules.db', echo=True)

Sqlite is a simple very-lightweight database tool - without support for concurrent users - but it's great for little hacks like this. For full-on database work you'll probably want to use https://www.postgresql.org .

The metadata for the database describing the tables present, and their columns, is defined in Python using SQLAlchemy, the leading python database tool, thus:

In [13]:
from sqlalchemy import Table, Column, Integer, Float, String, MetaData, ForeignKey
metadata = MetaData()
molecules = Table('molecules', metadata,
                  Column('name', String, primary_key=True),
                  Column('mass', Float))

atoms = Table('atoms', metadata,
      Column('symbol', String, primary_key=True),
      Column('number', Integer)
             )
In [14]:
atoms_in_molecules = Table('atoms_molecules', metadata,
       Column('atom', None, ForeignKey('atoms.symbol')),
       Column('molecule', None, ForeignKey('molecules.name')),
       Column('number', Integer)
)
In [15]:
metadata.create_all(engine)
print(metadata)
2019-01-21 13:16:45,330 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2019-01-21 13:16:45,332 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,334 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2019-01-21 13:16:45,335 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,337 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("molecules")
2019-01-21 13:16:45,338 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,341 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("atoms")
2019-01-21 13:16:45,342 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,343 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("atoms_molecules")
2019-01-21 13:16:45,344 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,346 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE molecules (
	name VARCHAR NOT NULL, 
	mass FLOAT, 
	PRIMARY KEY (name)
)


2019-01-21 13:16:45,347 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,358 INFO sqlalchemy.engine.base.Engine COMMIT
2019-01-21 13:16:45,359 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE atoms (
	symbol VARCHAR NOT NULL, 
	number INTEGER, 
	PRIMARY KEY (symbol)
)


2019-01-21 13:16:45,361 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,372 INFO sqlalchemy.engine.base.Engine COMMIT
2019-01-21 13:16:45,374 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE atoms_molecules (
	atom VARCHAR, 
	molecule VARCHAR, 
	number INTEGER, 
	FOREIGN KEY(atom) REFERENCES atoms (symbol), 
	FOREIGN KEY(molecule) REFERENCES molecules (name)
)


2019-01-21 13:16:45,375 INFO sqlalchemy.engine.base.Engine ()
2019-01-21 13:16:45,386 INFO sqlalchemy.engine.base.Engine COMMIT
MetaData(bind=None)

Note the SQL syntax for creating tables is generated by the python tool, and sent to the database server.

CREATE TABLE molecules (
    name VARCHAR NOT NULL, 
    mass FLOAT, 
    PRIMARY KEY (name)
)

We'll turn off our automatic printing of all the raw sql to avoid this notebook being unreadable.

In [16]:
engine.echo=False

We can also write data to our database using this python tooling:

In [17]:
ins = molecules.insert().values(name='water', mass='18.01')
In [18]:
conn = engine.connect()
conn.execute(ins)
Out[18]:
<sqlalchemy.engine.result.ResultProxy at 0x2b42d6a58cf8>

And query it:

In [19]:
from sqlalchemy.sql import select
s = select([molecules])
result = conn.execute(s)
In [20]:
print(result.fetchone()['mass'])
18.01

If we have enough understanding of sql syntax, we can use appropriate join statements to find, for example, the mass of all molecules which contain oxygen:

In [21]:
conn.execute(molecules.insert().values(name='oxygen', mass='16.00'))
conn.execute(atoms.insert().values(symbol='O', number=8))
conn.execute(atoms.insert().values(symbol='H', number=1))
conn.execute(atoms_in_molecules.insert().values(molecule='water',atom='O',number=1))
conn.execute(atoms_in_molecules.insert().values(molecule='oxygen',atom='O',number=2))
conn.execute(atoms_in_molecules.insert().values(molecule='water',atom='H', number=1))
Out[21]:
<sqlalchemy.engine.result.ResultProxy at 0x2b42d6aa2860>
In [22]:
result=conn.execute(
    'SELECT mass FROM molecules JOIN atoms_molecules ' + 
    'ON molecules.name = atoms_molecules.molecule JOIN atoms ' +
    'ON atoms.symbol = atoms_molecules.atom ' +
    'WHERE atoms.symbol = "H"')
print(result.fetchall())
[(18.01,)]

But we can do much better...

Data and Objects - the Object-Relational-Mapping

We notice that when we find a correct relational model for our data, many of the rows are suggestive of exactly the data we would expect to supply to an object constructor - data about an object. References to keys of other tables in rows suggest composition relations while many-to-many join tables often represent aggregation relationships, and data about the relationship.

As a result of this, powerful tools exist to automatically create object structures from database schema, including saving and loading.

In [23]:
import os
try:
    os.remove('molecules.db')
    print("Remove database to teach again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
Remove database to teach again from scratch
In [24]:
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:///molecules.db')
In [25]:
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

class Element(Base):
    __tablename__ = "atoms"
    symbol = Column(String, primary_key=True)
    number = Column(Integer)
    molecules = relationship("AtomsPerMolecule", backref="atom")
In [26]:
class Molecule(Base):
    __tablename__ = "molecules"
    name = Column(String, primary_key=True)
    mass = Column(Float)
    atoms = relationship("AtomsPerMolecule", backref="molecule")
In [27]:
class AtomsPerMolecule(Base):
    __tablename__ = 'atoms_per_molecule'
    id = Column(Integer, primary_key=True)
    atom_id = Column(None, ForeignKey('atoms.symbol'))
    molecule_id = Column(None, ForeignKey('molecules.name'))
    number = Column(Integer)

If we now create our tables, the system will automatically create a DB:

In [28]:
Base.metadata.create_all(engine)
In [29]:
engine.echo=False

And we can create objects with a simple interface that looks just like ordinary classes:

In [30]:
oxygen = Element(symbol='O',number=8)
hydrogen = Element(symbol='H', number=1)
elements = [oxygen, hydrogen]
In [31]:
water = Molecule(name='water',mass=18.01)
oxygen_m = Molecule(name='oxygen', mass=16.00)
hydrogen_m = Molecule(name='hydrogen', mass=2.02)
molecules = [water, oxygen_m, hydrogen_m]
In [32]:
amounts = [
    AtomsPerMolecule(atom=oxygen, molecule=water, number =1),
    AtomsPerMolecule(atom=hydrogen, molecule=water, number =2),
    AtomsPerMolecule(atom=oxygen, molecule=oxygen_m, number =2),
    AtomsPerMolecule(atom=hydrogen, molecule=hydrogen_m, number =2)
]
In [33]:
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session=Session()
In [34]:
session.bulk_save_objects(elements+molecules+amounts)
In [35]:
oxygen.molecules[0].molecule.name
Out[35]:
'water'
In [36]:
session.query(Molecule).all()[0].name
Out[36]:
'water'
In [37]:
session.commit()

This is a very powerful technique - we get our class-type interface in python, with database persistence and searchability for free!

Moving on from databases

However, this is a lecture primarily on file-format design. Databases are one choice, but we often want to make a human-readable file format to represent our dataset.

One choice, now we know about it, is to serialise all the database tables as CSV:

In [38]:
import pandas
In [39]:
str(session.query(Molecule).statement)
Out[39]:
'SELECT molecules.name, molecules.mass \nFROM molecules'
In [40]:
dataframe = pandas.read_sql(session.query(Molecule).statement, session.bind)
In [41]:
dataframe
Out[41]:
name mass
0 water 18.01
1 oxygen 16.00
2 hydrogen 2.02
In [42]:
import sys
dataframe.to_csv(sys.stdout)
,name,mass
0,water,18.01
1,oxygen,16.0
2,hydrogen,2.02

Deserialising is also easy:

In [43]:
%%writefile atoms.csv

symbol,number
C, 6
N, 7
Overwriting atoms.csv
In [44]:
from pathlib import Path
atoms = pandas.read_csv(open('atoms.csv'))
atoms
Out[44]:
symbol number
0 C 6
1 N 7
In [45]:
atoms.to_sql('atoms', session.bind, if_exists='append', index=False)
In [46]:
session.query(Element).all()[3].number
Out[46]:
7

However, we know from last term that another common choice is to represent such complicated data structures in YAML. The implications of what we've just learned for serialising to and from such structured data is the topic of the next lecture.