2.2.4.3 Categorical Data

What is Categorical Data?

Categorical variables relate to data that is summarised in groups or has a limited set of possible values. Examples could be t-shirt size (small, medium, large, …), degree grade (first class, second class, …), or car manufacturer (BMW, Ford, Toyota, …).

In this section we will use the 2017 Patient Characteristics Survey (PCS) from New York state in the USA, which captures high-level information about the demographics and health of all people using a public mental health service in one week in 2017:

import pandas as pd
import numpy as np

df = pd.read_csv("data/pcs_2017.csv", na_values=["UNKNOWN", "NOT APPLICABLE"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 32 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   Program Category                  5000 non-null   object
 1   Region Served                     5000 non-null   object
 2   Age Group                         4999 non-null   object
 3   Sex                               4992 non-null   object
 4   Living Situation                  4764 non-null   object
 5   Household Composition             3886 non-null   object
 6   Preferred Language                4940 non-null   object
 7   Veteran Status                    4879 non-null   object
 8   Employment Status                 5000 non-null   object
 9   Number Of Hours Worked Each Week  789 non-null    object
 10  Education Status                  4546 non-null   object
 11  Special Education Services        871 non-null    object
 12  Mental Illness                    4955 non-null   object
 13  Intellectual Disability           4652 non-null   object
 14  Autism Spectrum                   4688 non-null   object
 15  Alcohol Related Disorder          4788 non-null   object
 16  Drug Substance Disorder           4792 non-null   object
 17  Mobility Impairment Disorder      4721 non-null   object
 18  Alzheimer or Dementia             4430 non-null   object
 19  Neurological Condition            4430 non-null   object
 20  Traumatic Brain Injury            4430 non-null   object
 21  Cancer                            4430 non-null   object
 22  Smokes                            4774 non-null   object
 23  Received Smoking Medication       4659 non-null   object
 24  Received Smoking Counseling       4650 non-null   object
 25  Serious Mental Illness            4962 non-null   object
 26  Principal Diagnosis Class         4796 non-null   object
 27  SSI Cash Assistance               4448 non-null   object
 28  SSDI Cash Assistance              4451 non-null   object
 29  Public Assistance Cash Program    4388 non-null   object
 30  Other Cash Benefits               4432 non-null   object
 31  Three Digit Residence Zip Code    5000 non-null   int64 
dtypes: int64(1), object(31)
memory usage: 1.2+ MB

This dataset encodes missing information with the string “UNKNOWN” and unapplicable questions with the string “NOT APPLICABLE”. We’ve asked Pandas to treat these as NaN (null) values by passing them to the na_values argument.

The original source data has over 175,000 patients and more than 60 columns. We’re using a smaller subset of the data here for teaching purposes.

All of the columns in this dataset, apart from the three-digit ZIP code, are categorical:

df.head()
Program Category Region Served Age Group Sex Living Situation Household Composition Preferred Language Veteran Status Employment Status Number Of Hours Worked Each Week ... Smokes Received Smoking Medication Received Smoking Counseling Serious Mental Illness Principal Diagnosis Class SSI Cash Assistance SSDI Cash Assistance Public Assistance Cash Program Other Cash Benefits Three Digit Residence Zip Code
0 INPATIENT HUDSON RIVER REGION ADULT FEMALE INSTITUTIONAL SETTING NaN ENGLISH NO NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... NaN ... NO NO NO YES MENTAL ILLNESS NO NO NO YES 105
1 SUPPORT WESTERN REGION CHILD MALE PRIVATE RESIDENCE COHABITATES WITH OTHERS ENGLISH NO NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... NaN ... NO NO NO YES MENTAL ILLNESS YES NO NO NO 138
2 OUTPATIENT WESTERN REGION CHILD FEMALE PRIVATE RESIDENCE COHABITATES WITH OTHERS ENGLISH NO NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... NaN ... NO NO NO YES MENTAL ILLNESS NaN NaN NaN NaN 140
3 OUTPATIENT NEW YORK CITY REGION CHILD FEMALE PRIVATE RESIDENCE COHABITATES WITH OTHERS ENGLISH NO NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... NaN ... NO NO NO NO NaN NO NO NO NO 113
4 OUTPATIENT LONG ISLAND REGION CHILD FEMALE PRIVATE RESIDENCE COHABITATES WITH OTHERS ENGLISH NO NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... NaN ... NO NO NO YES MENTAL ILLNESS NO NO NO NO 115

5 rows × 32 columns

Numerical Encodings for Categorical Data

Typically, it’s not possible to input strings representing categories directly into a machine learning algorithm or other analysis, so they must be transformed into a numerical form first. This also applies if we have coded categories (such as ID numbers) in the data - blindly feeding these into an algorithm can have unexpected results. Below we describe common ways of encoding categorical data into numeric forms.

Here we are talking encoding data with a countable number of categories. For data without a set of pre-defined options, like free text fields, it’s more appropriate to use other feature engineering techniques (such as those from Natural Language Processing - see Section 2.2.4.2).

Binary Data

A lot of the columns in this dataset have only two possible values, such as “Yes/No” questions like whether a patient smokes:

df["Smokes"].value_counts()
NO     3387
YES    1387
Name: Smokes, dtype: int64

In this case it’s natural to encode “NO” as 0 and “YES” as 1:

yesno_encoding = {
    "YES": 1,
    "NO": 0,
}
df["Smokes"] = df["Smokes"].replace(yesno_encoding)

df["Smokes"].value_counts()
0.0    3387
1.0    1387
Name: Smokes, dtype: int64

It’s also possible to replace YES/NO values in multiple columns at a time, e.g., with df[["Smokes",  "Cancer"]].replace(yesno_encoding), or in the whole data frame with df.replace(yesno_encoding).

This can be applied to any data with only two possible values, a coin flip dataset with a column side = ["Heads", "Tails", "Heads"] could be encoded as a single column heads = [1, 0, 1], for example.

Data with Multiple Categories

Ordinal Data

When the categories have a natural order (we could sort them), for example the number of hours worked per week in this dataset:

df["Number Of Hours Worked Each Week"].value_counts()
35 HOURS OR MORE            309
15-34 HOURS                 251
01-14 HOURS                 119
UNKNOWN EMPLOYMENT HOURS    110
Name: Number Of Hours Worked Each Week, dtype: int64

This is essentially binned data (see Section 2.2.3, feature engineering) - the source data is numeric, but we have it summarised in groups due to the design of the survey. It’s also common to see this for privacy/anonymisation reasons in public versions of sensitive datasets.

One approach here is to encode the data using the mid-point of each group:

hours_encoding = {
    "01-14 HOURS": 7.5,
    "15-34 HOURS": 24.5,
    "35 HOURS OR MORE": 35,
    "UNKNOWN EMPLOYMENT HOURS": np.nan,
}
df["Number Of Hours Worked Each Week"] = df["Number Of Hours Worked Each Week"].replace(
    hours_encoding
)

We could use techniques for manipulating strings, like those seen in the Section 2.2.4.2, to extract the digits and calculate the mid-points rather than doing it manually.

  • How could you decide what value to use for the “35 HOURS OR MORE” group? Is 35 likely to be a good choice?

Alternatively, we can have ordinal data that was not numeric to start with, such as the “Education Status” column in this dataset:

df["Education Status"].value_counts()
MIDDLE SCHOOL TO HIGH SCHOOL    2585
SOME COLLEGE                     737
COLLEGE OR GRADUATE DEGREE       699
PRE-K TO FIFTH GRADE             433
OTHER                             68
NO FORMAL EDUCATION               24
Name: Education Status, dtype: int64

Here a sensible choice could be to encode the values by a ranking of how long the person was in education:

edu_encoding = {
    "NO FORMAL EDUCATION": 0,
    "PRE-K TO FIFTH GRADE": 1,
    "MIDDLE SCHOOL TO HIGH SCHOOL": 2,
    "SOME COLLEGE": 3,
    "COLLEGE OR GRADUATE DEGREE": 4,
    "OTHER": np.nan,
}
df["Education Status"] = df["Education Status"].replace(edu_encoding)

One-Hot Encoding

What if the categories don’t have a clear order? One example is “Region Served” in the patient characteristics survey:

df["Region Served"].value_counts()
NEW YORK CITY REGION    2486
WESTERN REGION           847
HUDSON RIVER REGION      775
CENTRAL NY REGION        534
LONG ISLAND REGION       358
Name: Region Served, dtype: int64

If we assigned each region a number from 1-5, an algorithm could try to infer relationships that don’t exist. For example, we may encode WESTERN_REGION = 2 and CENTRAL NY REGION = 4, but this means we’re representing Central NY as “double” Western NY - double what?

Instead, a common choice is to “one-hot” encode the column, converting it into a binary column for each category. This is also referred to as creating “dummy variables”, and can be done with the pd.get_dummies function:

pd.get_dummies(df["Region Served"])
CENTRAL NY REGION HUDSON RIVER REGION LONG ISLAND REGION NEW YORK CITY REGION WESTERN REGION
0 0 1 0 0 0
1 0 0 0 0 1
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 1 0 0
... ... ... ... ... ...
4995 0 0 0 0 1
4996 0 0 0 0 1
4997 1 0 0 0 0
4998 1 0 0 0 0
4999 0 0 1 0 0

5000 rows × 5 columns

The first patient was from the Hudson river region, so has one in the “HUDSON RIVER REGION” column and zero in the remaining columns, and so on. Treating the data this way allows a model to separately fit parameters explaining the relationship between each region and whatever we’re trying to predict.

Technically we only need four of the five columns above to represent the possible region values. If we didn’t include the “WESTERN REGION” column above, for example, we could still tell whether someone had care in that region by all the remaining columns being zero. Setting drop_first=True in pd.get_dummies will return one fewer column.

In some cases, creating many new columns with one-hot encoding could be problematic. This blog post gives an overview of when this might be a problem and possible solutions.

Grouping Categories

Rather than creating many new columns for all the possible categories, it may be appropriate to summarise them in sub-groups. For example, these are the categories in the “Preferred Language” column:

df["Preferred Language"].value_counts()
ENGLISH                     4459
SPANISH                      359
INDO-EUROPEAN                 55
ASIAN AND PACIFIC ISLAND      27
ALL OTHER LANGUAGES           26
AFRO-ASIATIC                  14
Name: Preferred Language, dtype: int64

If our research question related to finding differences between the care given to English and non-English speakers, we could encode the data as a single binary column (but you should consider this choice carefully, as we would not be able to identify differences between the other languages with this encoding):

prefer_english = (df["Preferred Language"] == "ENGLISH").astype(int)
prefer_english.value_counts()
1    4459
0     541
Name: Preferred Language, dtype: int64

The use of astype(int) above converts the created boolean series of True/False values to a series of ones (True) and zeros (False), which may be more suitable for input to a model.

Any value that is not “ENGLISH”, including missing values, will be encoded as 0 above, i.e., 0 means “unknown or not English”. It may be better to restore the missing values to avoid mislabelling unknown English speakers:

prefer_english[df["Preferred Language"].isnull()] = np.nan

Which Encoding to Choose?

The encodings described above are the most common, but there are others and there’s also no fixed rule that you must you a certain type of encoding for a certain type of data. There may be applications where one-hot encoding an ordinal column would be appropriate, for example.

Other Encoding Implementations

In the examples above we mostly convert the categorical values by defining Python dictionaries with the encodings and using the Pandas replace function. This works but there are other functions and libraries which can help do the job (and may be more maintainable and scalable depending on your use-case):

Unseen Values

If you’re developing a model with categorical data, it’s important to consider what to do, or how the model will behave, with data from new groups that are not present in the original dataset. Say you plan to use a model developed on the 2017 PCS data on new data from 2019, but a new region in New York has been added. How will patients from that region be treated by the model?

This is essentially a different form of missing data (in this case data the model doesn’t know how to interpret), which we discuss later.

Data Dictionaries

We’ve talked about converting a dataset of strings into numerical representations that can be incorporated into a model or other further analysis. It’s also common to be given a dataset that has coded values (numbers or other identifiers) to start with. For example, what if we started with data where Education Status was encoded as values 1 to 5. How would we know what 3 means?

Datasets should come with “data dictionaries” that explain what these values mean. You could change the encoded values to their meanings by defining them in a Python dictionary and using replace in the same way as above. Or you may need to link to another table or dataset - see linking datasets, Section 2.2.6.