10.1 Using databases#

Estimated time for this notebook: 20 minutes

The principles of database normalisation and the relational model will be helpful when thinking about any kind of data representation. This is true for dataframes in Pandas, tensors in tensorflow, or anything else.

However, proper databases use more sophisticated representations than just csv files including:

  • indices to enable hash-table like efficient lookups

  • support for managing multiple users at the same time.

Making a database - SQLite#

Let’s look at how we would use a simple database in Python to represent atoms and molecules. If you’ve never seen SQL before, you might want to attend an introductory course, such as one of the ‘Software Carpentry’ sessions. Here we’re going to assume some existing knowledge but we will use a Python-style way to interact with databases instead of relying on raw SQL.

import os

try:
    os.remove("molecules.db")
    print("Removing database to start again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
Removing database to start again from scratch
import sqlalchemy

engine = sqlalchemy.create_engine("sqlite:///molecules.db", echo=True)

SQLite is a simple very-lightweight database tool - without support for concurrent users - but it’s great for little hacks like this. For full-on database work you’ll probably want to use a more fully-featured database like https://www.postgresql.org.

The metadata for the database describing the tables present, and their columns, is defined in Python using SQLAlchemy, the leading python database tool, thus:

from sqlalchemy import Column, Float, Integer, MetaData, String, Table

metadata = MetaData()
molecules = Table(
    "molecules",
    metadata,
    Column("name", String, primary_key=True),
    Column("mass", Float),
)

atoms = Table(
    "atoms",
    metadata,
    Column("symbol", String, primary_key=True),
    Column("number", Integer),
)
from sqlalchemy import ForeignKey, Integer

atoms_in_molecules = Table(
    "atoms_molecules",
    metadata,
    Column("atom", ForeignKey("atoms.symbol")),
    Column("molecule", ForeignKey("molecules.name")),
    Column("number", Integer),
)
metadata.create_all(engine)
print(metadata)
2024-06-03 08:25:10,190 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-06-03 08:25:10,191 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("molecules")
2024-06-03 08:25:10,192 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,193 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("molecules")
2024-06-03 08:25:10,194 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,194 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("atoms")
2024-06-03 08:25:10,195 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,196 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("atoms")
2024-06-03 08:25:10,196 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,197 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("atoms_molecules")
2024-06-03 08:25:10,198 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,199 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("atoms_molecules")
2024-06-03 08:25:10,199 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-03 08:25:10,201 INFO sqlalchemy.engine.Engine 
CREATE TABLE molecules (
	name VARCHAR NOT NULL, 
	mass FLOAT, 
	PRIMARY KEY (name)
)
2024-06-03 08:25:10,201 INFO sqlalchemy.engine.Engine [no key 0.00047s] ()
2024-06-03 08:25:10,204 INFO sqlalchemy.engine.Engine 
CREATE TABLE atoms (
	symbol VARCHAR NOT NULL, 
	number INTEGER, 
	PRIMARY KEY (symbol)
)
2024-06-03 08:25:10,205 INFO sqlalchemy.engine.Engine [no key 0.00049s] ()
2024-06-03 08:25:10,207 INFO sqlalchemy.engine.Engine 
CREATE TABLE atoms_molecules (
	atom VARCHAR, 
	molecule VARCHAR, 
	number INTEGER, 
	FOREIGN KEY(atom) REFERENCES atoms (symbol), 
	FOREIGN KEY(molecule) REFERENCES molecules (name)
)
2024-06-03 08:25:10,207 INFO sqlalchemy.engine.Engine [no key 0.00046s] ()
2024-06-03 08:25:10,210 INFO sqlalchemy.engine.Engine COMMIT
MetaData()

Note the SQL syntax for creating tables is generated by the python tool, and sent to the database server.

CREATE TABLE molecules (
	name VARCHAR NOT NULL, 
	mass FLOAT, 
	PRIMARY KEY (name)
)

We’ll turn off our automatic printing of all the raw sql to avoid this notebook being unreadable.

engine.echo = False

We can also write data to our database using this python tooling:

ins = molecules.insert().values(name="water", mass="18.01")
conn = engine.connect()
conn.execute(ins)
/tmp/ipykernel_16808/3952573975.py:2: RemovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
  conn.execute(ins)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fb73416d3d0>

And query it:

from sqlalchemy.sql import select

s = select([molecules])
result = conn.execute(s)
print(result.fetchone()["mass"])
18.01

If we have enough understanding of SQL syntax, we can use appropriate join statements to find, for example, the mass of all molecules which contain oxygen:

conn.execute(molecules.insert().values(name="oxygen", mass="32.00"))
conn.execute(atoms.insert().values(symbol="O", number=8))
conn.execute(atoms.insert().values(symbol="H", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="water", atom="O", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="oxygen", atom="O", number=1))
conn.execute(atoms_in_molecules.insert().values(molecule="water", atom="H", number=2))
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fb72c6e1d60>
result = conn.execute(
    """
    SELECT mass
    FROM   molecules
          JOIN atoms_molecules
            ON molecules.NAME = atoms_molecules.molecule
          JOIN atoms
            ON atoms.symbol = atoms_molecules.atom
    WHERE  atoms.symbol = 'H'
    """
)
print(result.fetchall())
[(18.01,)]

But we can do much better…

Data and Objects - the Object-Relational-Mapping#

We notice that when we find a correct relational model for our data, many of the rows are suggestive of exactly the data we would expect to supply to an object constructor - data about an object. References to keys of other tables in rows suggest composition relations while many-to-many join tables often represent aggregation relationships, and data about the relationship.

As a result of this, powerful tools exist to automatically create object structures from database schema, including saving and loading.

import os

try:
    os.remove("molecules.db")
    print("Removing database to start again from scratch")
except FileNotFoundError:
    print("No DB since this notebook was last run")
Removing database to start again from scratch
import sqlalchemy

engine = sqlalchemy.create_engine("sqlite:///molecules.db")
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship

Base = declarative_base()


class Element(Base):
    __tablename__ = "atoms"
    symbol = Column(String, primary_key=True)
    number = Column(Integer)
    molecules = relationship("AtomsPerMolecule", backref="atom")
class Molecule(Base):
    __tablename__ = "molecules"
    name = Column(String, primary_key=True)
    mass = Column(Float)
    atoms = relationship("AtomsPerMolecule", backref="molecule")
class AtomsPerMolecule(Base):
    __tablename__ = "atoms_per_molecule"
    id = Column(Integer, primary_key=True)
    atom_id = Column(None, ForeignKey("atoms.symbol"))
    molecule_id = Column(None, ForeignKey("molecules.name"))
    number = Column(Integer)

If we now create our tables, the system will automatically create a DB:

Base.metadata.create_all(engine)
engine.echo = False

And we can create objects with a simple interface that looks just like ordinary classes:

oxygen = Element(symbol="O", number=8)
hydrogen = Element(symbol="H", number=1)
elements = [oxygen, hydrogen]
water = Molecule(name="water", mass=18.01)
oxygen_m = Molecule(name="oxygen", mass=16.00)
hydrogen_m = Molecule(name="hydrogen", mass=2.02)
molecules = [water, oxygen_m, hydrogen_m]
# Note that we are using the `backref` name to construct the `atom_id` and `molecule_id`.
# These lookup instances of Element and Molecule that are already in our database
amounts = [
    AtomsPerMolecule(atom=oxygen, molecule=water, number=1),
    AtomsPerMolecule(atom=hydrogen, molecule=water, number=2),
    AtomsPerMolecule(atom=oxygen, molecule=oxygen_m, number=2),
    AtomsPerMolecule(atom=hydrogen, molecule=hydrogen_m, number=2),
]
from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine)
session = Session()
session.bulk_save_objects(elements + molecules + amounts)
oxygen.molecules[0].molecule.name
'water'
session.query(Molecule).all()[0].name
'water'
session.commit()

This is a very powerful technique - we get our class-type interface in python, with database persistence and searchability for free!

Moving on from databases#

Databases are often a good choice for storing data, but can only be interacted with programmatically. Often, we want to make a file format to represent our dataset which can be easily replicated or shared. The next part of this module focuses on the design of such file-formats, both binary and human-readable.

One choice, now we know about it, is to serialise all the database tables as CSV:

import pandas
str(session.query(Molecule).statement)
'SELECT molecules.name, molecules.mass \nFROM molecules'
dataframe = pandas.read_sql(session.query(Molecule).statement, session.bind)
dataframe
name mass
0 water 18.01
1 oxygen 16.00
2 hydrogen 2.02
print(dataframe.to_csv())
,name,mass
0,water,18.01
1,oxygen,16.0
2,hydrogen,2.02

Deserialising is also easy:

%%writefile atoms.csv

symbol,number
C,6
N,7
Overwriting atoms.csv
with open("atoms.csv", "r") as f_csv:
    atoms = pandas.read_csv(f_csv)
atoms
symbol number
0 C 6
1 N 7
atoms.to_sql("atoms", session.bind, if_exists="append", index=False)
2
session.query(Element).all()[3].number
7

We know from earlier that another common choice is to represent such complicated data structures is YAML.

The implications of what we’ve just learned for serialising to and from such structured data is the topic of the next lecture.