10.4 Markup Languages
Contents
10.4 Markup Languages#
Estimated time for this notebook: 10 minutes
XML and its relatives (including HTML) are based on the idea of marking up content with labels on its purpose:
<name>James</name> is a <job>Programmer</job>
We want to represent the chemical reactions: \(C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O\\ 2H_2 + O_2 \rightarrow 2H_2O\)
In xml this might look like:
%%writefile system.xml
<?xml version="1.0" encoding="UTF-8"?>
<system>
<reaction>
<reactants>
<molecule stoichiometry="1">
<atom symbol="C" number="6"/>
<atom symbol="H" number="12"/>
<atom symbol="O" number="6"/>
</molecule>
<molecule stoichiometry="6">
<atom symbol="O" number="2"/>
</molecule>
</reactants>
<products>
<molecule stoichiometry="6">
<atom symbol="C" number="1"/>
<atom symbol="O" number="2"/>
</molecule>
<molecule stoichiometry="6">
<atom symbol="H" number="2"/>
<atom symbol="O" number="1"/>
</molecule>
</products>
</reaction>
<reaction>
<reactants>
<molecule stoichiometry="2">
<atom symbol="H" number="2"/>
</molecule>
<molecule stoichiometry="1">
<atom symbol="O" number="2"/>
</molecule>
</reactants>
<products>
<molecule stoichiometry="2">
<atom symbol="H" number="2"/>
<atom symbol="O" number="1"/>
</molecule>
</products>
</reaction>
</system>
Overwriting system.xml
Markup languages are verbose (jokingly called the “angle bracket tax”) but very clear.
Parsing XML#
XML is normally parsed by building a tree-structure of all the tags
in the file, called a DOM
or Document Object Model.
from lxml import etree
with open("system.xml", "r") as xmlfile:
tree = etree.parse(xmlfile)
print(etree.tostring(tree, pretty_print=True, encoding=str))
<system>
<reaction>
<reactants>
<molecule stoichiometry="1">
<atom symbol="C" number="6"/>
<atom symbol="H" number="12"/>
<atom symbol="O" number="6"/>
</molecule>
<molecule stoichiometry="6">
<atom symbol="O" number="2"/>
</molecule>
</reactants>
<products>
<molecule stoichiometry="6">
<atom symbol="C" number="1"/>
<atom symbol="O" number="2"/>
</molecule>
<molecule stoichiometry="6">
<atom symbol="H" number="2"/>
<atom symbol="O" number="1"/>
</molecule>
</products>
</reaction>
<reaction>
<reactants>
<molecule stoichiometry="2">
<atom symbol="H" number="2"/>
</molecule>
<molecule stoichiometry="1">
<atom symbol="O" number="2"/>
</molecule>
</reactants>
<products>
<molecule stoichiometry="2">
<atom symbol="H" number="2"/>
<atom symbol="O" number="1"/>
</molecule>
</products>
</reaction>
</system>
We can navigate the tree, with each element being an iterable yielding its children:
tree.getroot()[0][0][1].attrib["stoichiometry"]
'6'
Searching XML#
xpath
is a sophisticated tool for searching XML DOMs:
There’s a good explanation of how it works here: https://www.w3schools.com/xml/xml_xpath.asp but the basics are reproduced below.
XPath Expression |
Result |
---|---|
|
Selects the first |
|
Selects the last |
|
Selects the last but one |
|
Selects the first two |
|
Selects all |
|
Selects all |
|
Selects all |
|
Selects all the |
# For all molecules
# ... with a child atom whose number attribute is '1'
# ... return the symbol attribute of that child
tree.xpath("//molecule/atom[@number='1']/@symbol")
['C', 'O', 'O']
It is useful to understand grammars like these using the “FOR-LET-WHERE-ORDER-RETURN” (pronounced Flower) model.
The above says: “For element in molecules where number is one, return symbol”, roughly equivalent to [element.symbol for element in molecule for molecule in document if element.number==1]
in Python.
Transforming XML : XSLT#
Two technologies (XSLT and XQUERY) provide capability to produce text output from an XML tree.
We’ll look at XSLT as support is more widespread, including in the python library we’re using. XQuery is probably easier to use and understand, but with less support.
However, XSLT is a beautiful functional declarative language, once you read past the angle-brackets.
Here’s an XSLT to transform our reaction system into a LaTeX representation:
%%writefile xmltotex.xsl
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />
<!-- Decompose reaction into "reactants \rightarrow products" -->
<xsl:template match="//reaction">
<xsl:apply-templates select="reactants"/>
<xsl:text> \rightarrow </xsl:text>
<xsl:apply-templates select="products"/>
<xsl:text>\\
</xsl:text>
</xsl:template>
<!-- For a molecule anywhere except the first position write " + " and the number of molecules-->
<xsl:template match="//molecule[position()!=1]">
<xsl:text> + </xsl:text>
<xsl:apply-templates select="@stoichiometry"/>
<xsl:apply-templates/>
</xsl:template>
<!-- For a molecule in first position write the number of molecules -->
<xsl:template match="//molecule[position()=1]">
<xsl:apply-templates select="@stoichiometry"/>
<xsl:apply-templates/>
</xsl:template>
<!-- If the stoichiometry is one then ignore it -->
<xsl:template match="@stoichiometry[.='1']"/>
<!-- Otherwise, use the default template for attributes, which is just to copy value -->
<!-- Decompose element into "symbol number" -->
<xsl:template match="//atom">
<xsl:value-of select="@symbol"/>
<xsl:apply-templates select="@number"/>
</xsl:template>
<!-- If the number of elements/molecules is one then ignore it -->
<xsl:template match="@number[.=1]"/>
<!-- ... otherwise replace it with "_ value" -->
<xsl:template match="@number[.!=1][10>.]">
<xsl:text>_</xsl:text>
<xsl:value-of select="."/>
</xsl:template>
<!-- If a number is greater than 10 then wrap it in "{}" -->
<xsl:template match="@number[.!=1][.>9]">
<xsl:text>_{</xsl:text>
<xsl:value-of select="."/>
<xsl:text>}</xsl:text>
</xsl:template>
<!-- Do not copy input whitespace to output -->
<xsl:template match="text()" />
</xsl:stylesheet>
Overwriting xmltotex.xsl
with open("xmltotex.xsl") as xslfile:
transform_xsl = xslfile.read()
transform = etree.XSLT(etree.XML(transform_xsl))
print(str(transform(tree)))
C_6H_{12}O_6 + 6O_2 \rightarrow 6CO_2 + 6H_2O\\
2H_2 + O_2 \rightarrow 2H_2O\\
Which is back to the LaTeX representation of our reactions.
Validating XML : Schema#
XML Schema is a way to define how an XML file is allowed to be: which attributes and tags should exist where.
You should always define one of these when using an XML file format.
%%writefile reactions.xsd
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="atom">
<xs:complexType>
<xs:attribute name="symbol" type="xs:string"/>
<xs:attribute name="number" type="xs:integer"/>
</xs:complexType>
</xs:element>
<xs:element name="molecule">
<xs:complexType>
<xs:sequence>
<xs:element ref="atom" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="stoichiometry" type="xs:integer"/>
</xs:complexType>
</xs:element>
<xs:element name="reaction">
<xs:complexType>
<xs:sequence>
<xs:element name="reactants">
<xs:complexType>
<xs:sequence>
<xs:element ref="molecule" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="products">
<xs:complexType>
<xs:sequence>
<xs:element ref="molecule" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="system">
<xs:complexType>
<xs:sequence>
<xs:element ref="reaction" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Overwriting reactions.xsd
with open("reactions.xsd") as xsdfile:
schema_xsd = xsdfile.read()
schema = etree.XMLSchema(etree.XML(schema_xsd))
parser = etree.XMLParser(schema=schema)
with open("system.xml") as xmlfile:
tree = etree.parse(xmlfile, parser)
# For all atoms return their symbol attribute
tree.xpath("//atom/@symbol")
['C', 'H', 'O', 'O', 'C', 'O', 'H', 'O', 'H', 'O', 'H', 'O']
Compare parsing something that is not valid under the schema:
%%writefile invalid_system.xml
<system>
<reaction>
<reactants>
<molecule stoichiometry="two">
<atom symbol="H" number="2"/>
</molecule>
<molecule stoichiometry="1">
<atom symbol="O" number="2"/>
</molecule>
</reactants>
<products>
<molecule stoichiometry="2">
<atom symbol="H" number="2"/>
<atom symbol="O" number="1"/>
</molecule>
</products>
</reaction>
</system>
Overwriting invalid_system.xml
try:
with open("invalid_system.xml") as xmlfile:
tree = etree.parse(xmlfile, parser)
tree.xpath("//element//@symbol")
except etree.XMLSyntaxError as e:
print(e)
Element 'molecule', attribute 'stoichiometry': 'two' is not a valid value of the atomic type 'xs:integer'. (<string>, line 0)
This shows us that the validation has failed and why.