2.2.4.2 Text data

We’ll often want to manipulate text data (strings) in Python. There are many handy libraries for helping us do this, some of which allow some pretty complicated operations. Here, we’ll show some basic processing.

Inconsistencies

Strings have their own particular array of consistency issues, such as inconsistent capitalisation and extraneous whitespace.

Fortunately, python gives us some handy built-in functionality for dealing with somese issues.

We’ll make note of a few of these methods, below.

str.upper() and str.lower()

The str.upper() and str.lower() methods will take a given string and return a copy as a solely uppercase or lowercase string. E.g.

print(f"upper: {'Foo'.upper()}")
print(f"lower: {'Foo'.lower()}")
upper: FOO
lower: foo

These methods can be useful for ensuring consistency when casing is not important in your data.

str.strip()

The str.strip() method (and its companions, str.lstrip and str.rstrip()) return a copy, stripping leading and trailing characters (default to whitespace) from a string. E.g.

stripped = " foo bar ".strip()

print(f"stripped: '{stripped}'") 
lstripped =" foo bar ".lstrip()
print(f"left stripped: '{lstripped}'")

rstripped = " foo bar ".rstrip() # strip right
print(f"right stripped:'{rstripped}'")
stripped: 'foo bar'
left stripped: 'foo bar '
right stripped:' foo bar'

Spelling is tricky

The methods we’ve talked about so far don’t address things like misspelling/typos (a common data input concern).

In a relatively simple scenario, with categorical data encoded as strings, you might be able to spot these by checking for all unique values in your data. E.g.

my_favourite_fruit_data = ["apple", "apple", "pear", "orange", "aple", "orange", "grapefruit"]
print(set(my_favourite_fruit_data))
{'aple', 'orange', 'apple', 'pear', 'grapefruit'}

Splitting

We’ll also commonly want to split a string based on a particular delimiter or separator. For example, we may wish to split a string of text into individual words, using any whitespace separator.

We can use the str.split(sep=None, maxsplit=-1) method.

s = "this is some text".split() # whitespace is the default 
print(s)

# however, be careful of punctuation
s2 = "this, another example, is some more text".split()
print(s2)
['this', 'is', 'some', 'text']
['this,', 'another', 'example,', 'is', 'some', 'more', 'text']

We may also wish to split by separators other than whitespace.

s = "apple#banana#pear#peach".split("#")
print(s)
['apple', 'banana', 'pear', 'peach']

With this in mind, we could also use str.split for dealing with csv data. However, we’d need to be careful about commas inside quotes. It’s generally more convenient to use libraries that already deal with this kind of thing, like pandas!

Joining

A bit like str.split(sep=None, maxsplit=-1) in reverse, str.join(iterable) allows us to join a list of strings together with a given separator.

my_list = ["a", "list", "of", "words"] # join with a space separator 
s = " ".join(my_list) 
print(s)
a list of words

Regular expressions

Regular expressions (regexps, regex) are character sequences that specify a search pattern, usually for a find and/or replace task on text data.

Python’s regular expression module provides functionality similar to that offered in Perl.

Regex can give us powerful string matching, beyond that of a simple exact string match. E.g.

import re 
txt = "The rain in Spain falls mainly on the plains. So they say, anyway." # find all words starting with upper case S or lower case t 
print(re.findall(r"\b[St]\w+", txt))
['Spain', 'the', 'So', 'they']

NLP Preprocessing

In Natural Language Processing (NLP) tasks we often see some slightly more complicated preprocessing such as:

  • Stemming and Lemmatisation - reducing words to common base forms

  • Stop-word Removal - removing common words that carry little information

  • “Vectorization” - convert text to a meaningful numeric vector representation (e.g. term frequency encoding)

There are some commonly used libraries for the above tasks, we recommend NLTK and scikit-learn.

Pandas String Operations (Series.str)

Pandas provides vectorized string functions for Series. Unless explicitly handled, NAs will stay as NA. See here. E.g.

import pandas as pd

s = pd.Series(["aaa", "aab", "aba"]) # replace "a" with "A"
s.str.replace("a", "A")
0    AAA
1    AAb
2    AbA
dtype: object