Skip to content


Networks of Transcript Semantics (netts) is a network algorithm that builds on state-of-the-art Natural Language Processing libraries to create speech networks that capture semantic content. Netts takes transcripts of spoken text as input (e.g. I see a man) and outputs a semantic speech network.

Semantic Speech Network: Network that represents the semantic content of speech transcripts. In these networks, nodes are entities (e.g. I, man). Edges are relations between nodes (e.g. see).

Netts can capture semantic links between nodes in speech content, even when semantically these nodes are separated by several sentences. The algorithm is robust against artefacts typical for spoken text. As described in CLI usage, netts can be used to process a single transcript or a folder of many transcripts. With about 40 seconds processing time per speech transcript, netts takes little time to process large batches of speech transcripts and therefore is ideal for the automated construction of speech networks from large datasets.

In the following sections the netts processing pipeline is described in detail. See figure above for an overview of the netts pipeline.


Netts first expands the most common English contractions (e.g. expanding I'm to I am). It then removes interjections (Mh, Uhm). Netts also removes any transcription notes (e.g. timestamps, [inaudible]) that were inserted by the transcriber. The user can pass a file of transcription notes that should be removed from the transcripts before processing. See Configuration for a step-by-step guide on passing custom transcription notes to netts for removal. Netts does not remove stop words or punctuation to stay as close to the original speech as possible.

Netts then uses CoreNLP to perform sentence splitting, tokenization, part of speech tagging, lemmatization, dependency parsing and co-referencing on the transcript. Netts uses the default language model implemented in CoreNLP.

We describe these Natural Language Processing steps briefly in the following. The transcript is first split into sentences (sentence splitting). It is then further split into meaningful entities, usually words (tokenization). Each word is assigned a part of speech label. The part of speech label indicates whether the word is a verb, noun, or another part of speech (part of speech tagging). Each word is also assigned their dictionary form or lemma (lemmatization). Next, the grammatical relationship between words is identified (dependency parsing). Finally, any occurrences where two or more expressions in the transcript refer to the same entity are identified (co-referencing). For example where a noun man and a pronoun he refer to the same person.

Finding nodes and edges

Netts submits each sentence to OpenIE5 for relation extraction. Openie5 extracts semantic relationships between entities from the sentence. For example, performing relation extraction on the sentence I see a man identifies the relation see between the entities I and a man. From these extracted relations, netts creates an initial list of the edges that will be present in the semantic speech network. In the edge list, the entities are the nodes and the relations are the edge labels.

Next, netts uses the part of speech tags and dependency structure to extract edges defined by adjectives or prepositions: For instance, a man on the picture contains a preposition edge where the entity a man and the picture are linked by an edge labelled on. An example of an adjective edge would be dark background. Here, dark and background are linked by an implicit is. These adjective edges and preposition edges are added to the edge list. During the next processing steps this edge list is further refined.

Refining nodes and edges

After creating the edge list, netts uses the co-referencing information to merge nodes that refer to the same entity. This is to take into account cases different words refer to the same entity. For example in the case where the pronoun he is used to refer to a man or in the case where the synonym the guy is used to refer to a man. Every entity mentioned in the text should be represented by a unique node in the semantic speech network. Therefore, nodes referring to the same entity are merged by replacing the node label in the edge list with the most representative node label (first mention of the entity that is a noun). In the example above, he and the guy would be replaced by a man. Node labels are then cleaned of superfluous words such as determiners. For example, a man would turn into man.

Constructing network

In the final step, netts constructs a semantic speech network from the edge list using networkx. The network is then plotted and saves the output. The output consists of the networkx object, the network image and the log messages from netts. The resulting network (a MultiDiGraph) is directed and unweighted, and can have parallel edges and self-loops. Parallel edges are two or more edges that link the same two nodes in the same direction. A self-loop is an edge that links a node with itself. See here for an example semantic speech network along with the corresponding speech transcript and stimulus picture.