2.2.4.3 Categorical Data¶

What is Categorical Data?¶

Categorical variables relate to data that is summarised in groups or has a limited set of possible values. Examples could be t-shirt size (small, medium, large, …), degree grade (first class, second class, …), or car manufacturer (BMW, Ford, Toyota, …).

In this section we will use the 2017 Patient Characteristics Survey (PCS) from New York state in the USA, which captures high-level information about the demographics and health of all people using a public mental health service in one week in 2017:

import pandas as pd
import numpy as np

df = pd.read_csv("data/pcs_2017.csv", na_values=["UNKNOWN", "NOT APPLICABLE"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 32 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 Program Category                  5000 non-null   object
 Region Served                     5000 non-null   object
 Age Group                         4999 non-null   object
 Sex                               4992 non-null   object
 Living Situation                  4764 non-null   object
 Household Composition             3886 non-null   object
 Preferred Language                4940 non-null   object
 Veteran Status                    4879 non-null   object
 Employment Status                 5000 non-null   object
 Number Of Hours Worked Each Week  789 non-null    object
Education Status                  4546 non-null   object
Special Education Services        871 non-null    object
Mental Illness                    4955 non-null   object
Intellectual Disability           4652 non-null   object
Autism Spectrum                   4688 non-null   object
Alcohol Related Disorder          4788 non-null   object
Drug Substance Disorder           4792 non-null   object
Mobility Impairment Disorder      4721 non-null   object
Alzheimer or Dementia             4430 non-null   object
Neurological Condition            4430 non-null   object
Traumatic Brain Injury            4430 non-null   object
Cancer                            4430 non-null   object
Smokes                            4774 non-null   object
Received Smoking Medication       4659 non-null   object
Received Smoking Counseling       4650 non-null   object
Serious Mental Illness            4962 non-null   object
Principal Diagnosis Class         4796 non-null   object
SSI Cash Assistance               4448 non-null   object
SSDI Cash Assistance              4451 non-null   object
Public Assistance Cash Program    4388 non-null   object
Other Cash Benefits               4432 non-null   object
Three Digit Residence Zip Code    5000 non-null   int64 
dtypes: int64(1), object(31)
memory usage: 1.2+ MB

This dataset encodes missing information with the string “UNKNOWN” and unapplicable questions with the string “NOT APPLICABLE”. We’ve asked Pandas to treat these as NaN (null) values by passing them to the na_values argument.

The original source data has over 175,000 patients and more than 60 columns. We’re using a smaller subset of the data here for teaching purposes.

All of the columns in this dataset, apart from the three-digit ZIP code, are categorical:

df.head()

	Program Category	Region Served	Age Group	Sex	Living Situation	Household Composition	Preferred Language	Veteran Status	Employment Status	Number Of Hours Worked Each Week	...	Smokes	Received Smoking Medication	Received Smoking Counseling	Serious Mental Illness	Principal Diagnosis Class	SSI Cash Assistance	SSDI Cash Assistance	Public Assistance Cash Program	Other Cash Benefits	Three Digit Residence Zip Code
0	INPATIENT	HUDSON RIVER REGION	ADULT	FEMALE	INSTITUTIONAL SETTING	NaN	ENGLISH	NO	NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...	NaN	...	NO	NO	NO	YES	MENTAL ILLNESS	NO	NO	NO	YES	105
1	SUPPORT	WESTERN REGION	CHILD	MALE	PRIVATE RESIDENCE	COHABITATES WITH OTHERS	ENGLISH	NO	NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...	NaN	...	NO	NO	NO	YES	MENTAL ILLNESS	YES	NO	NO	NO	138
2	OUTPATIENT	WESTERN REGION	CHILD	FEMALE	PRIVATE RESIDENCE	COHABITATES WITH OTHERS	ENGLISH	NO	NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...	NaN	...	NO	NO	NO	YES	MENTAL ILLNESS	NaN	NaN	NaN	NaN	140
3	OUTPATIENT	NEW YORK CITY REGION	CHILD	FEMALE	PRIVATE RESIDENCE	COHABITATES WITH OTHERS	ENGLISH	NO	NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...	NaN	...	NO	NO	NO	NO	NaN	NO	NO	NO	NO	113
4	OUTPATIENT	LONG ISLAND REGION	CHILD	FEMALE	PRIVATE RESIDENCE	COHABITATES WITH OTHERS	ENGLISH	NO	NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...	NaN	...	NO	NO	NO	YES	MENTAL ILLNESS	NO	NO	NO	NO	115

5 rows × 32 columns

Numerical Encodings for Categorical Data¶

Typically, it’s not possible to input strings representing categories directly into a machine learning algorithm or other analysis, so they must be transformed into a numerical form first. This also applies if we have coded categories (such as ID numbers) in the data - blindly feeding these into an algorithm can have unexpected results. Below we describe common ways of encoding categorical data into numeric forms.

Here we are talking encoding data with a countable number of categories. For data without a set of pre-defined options, like free text fields, it’s more appropriate to use other feature engineering techniques (such as those from Natural Language Processing - see Section 2.2.4.2).

Binary Data¶

A lot of the columns in this dataset have only two possible values, such as “Yes/No” questions like whether a patient smokes:

df["Smokes"].value_counts()

NO     3387
YES    1387
Name: Smokes, dtype: int64

In this case it’s natural to encode “NO” as 0 and “YES” as 1:

yesno_encoding = {
    "YES": 1,
    "NO": 0,
}
df["Smokes"] = df["Smokes"].replace(yesno_encoding)

df["Smokes"].value_counts()

0.0    3387
1.0    1387
Name: Smokes, dtype: int64

It’s also possible to replace YES/NO values in multiple columns at a time, e.g., with df[["Smokes", "Cancer"]].replace(yesno_encoding), or in the whole data frame with df.replace(yesno_encoding).

This can be applied to any data with only two possible values, a coin flip dataset with a column side = ["Heads", "Tails", "Heads"] could be encoded as a single column heads = [1, 0, 1], for example.

Data with Multiple Categories¶

Ordinal Data¶

When the categories have a natural order (we could sort them), for example the number of hours worked per week in this dataset:

df["Number Of Hours Worked Each Week"].value_counts()

35 HOURS OR MORE            309
15-34 HOURS                 251
01-14 HOURS                 119
UNKNOWN EMPLOYMENT HOURS    110
Name: Number Of Hours Worked Each Week, dtype: int64

This is essentially binned data (see Section 2.2.3, feature engineering) - the source data is numeric, but we have it summarised in groups due to the design of the survey. It’s also common to see this for privacy/anonymisation reasons in public versions of sensitive datasets.

One approach here is to encode the data using the mid-point of each group:

hours_encoding = {
    "01-14 HOURS": 7.5,
    "15-34 HOURS": 24.5,
    "35 HOURS OR MORE": 35,
    "UNKNOWN EMPLOYMENT HOURS": np.nan,
}
df["Number Of Hours Worked Each Week"] = df["Number Of Hours Worked Each Week"].replace(
    hours_encoding
)

We could use techniques for manipulating strings, like those seen in the Section 2.2.4.2, to extract the digits and calculate the mid-points rather than doing it manually.

How could you decide what value to use for the “35 HOURS OR MORE” group? Is 35 likely to be a good choice?

Alternatively, we can have ordinal data that was not numeric to start with, such as the “Education Status” column in this dataset:

df["Education Status"].value_counts()

MIDDLE SCHOOL TO HIGH SCHOOL    2585
SOME COLLEGE                     737
COLLEGE OR GRADUATE DEGREE       699
PRE-K TO FIFTH GRADE             433
OTHER                             68
NO FORMAL EDUCATION               24
Name: Education Status, dtype: int64

Here a sensible choice could be to encode the values by a ranking of how long the person was in education:

edu_encoding = {
    "NO FORMAL EDUCATION": 0,
    "PRE-K TO FIFTH GRADE": 1,
    "MIDDLE SCHOOL TO HIGH SCHOOL": 2,
    "SOME COLLEGE": 3,
    "COLLEGE OR GRADUATE DEGREE": 4,
    "OTHER": np.nan,
}
df["Education Status"] = df["Education Status"].replace(edu_encoding)

One-Hot Encoding¶

What if the categories don’t have a clear order? One example is “Region Served” in the patient characteristics survey:

df["Region Served"].value_counts()

NEW YORK CITY REGION    2486
WESTERN REGION           847
HUDSON RIVER REGION      775
CENTRAL NY REGION        534
LONG ISLAND REGION       358
Name: Region Served, dtype: int64

If we assigned each region a number from 1-5, an algorithm could try to infer relationships that don’t exist. For example, we may encode WESTERN_REGION = 2 and CENTRAL NY REGION = 4, but this means we’re representing Central NY as “double” Western NY - double what?

Instead, a common choice is to “one-hot” encode the column, converting it into a binary column for each category. This is also referred to as creating “dummy variables”, and can be done with the pd.get_dummies function:

pd.get_dummies(df["Region Served"])

	CENTRAL NY REGION	HUDSON RIVER REGION	LONG ISLAND REGION	NEW YORK CITY REGION	WESTERN REGION
0	0	1	0	0	0
1	0	0	0	0	1
2	0	0	0	0	1
3	0	0	0	1	0
4	0	0	1	0	0
...	...	...	...	...	...
4995	0	0	0	0	1
4996	0	0	0	0	1
4997	1	0	0	0	0
4998	1	0	0	0	0
4999	0	0	1	0	0

5000 rows × 5 columns

The first patient was from the Hudson river region, so has one in the “HUDSON RIVER REGION” column and zero in the remaining columns, and so on. Treating the data this way allows a model to separately fit parameters explaining the relationship between each region and whatever we’re trying to predict.

Technically we only need four of the five columns above to represent the possible region values. If we didn’t include the “WESTERN REGION” column above, for example, we could still tell whether someone had care in that region by all the remaining columns being zero. Setting drop_first=True in pd.get_dummies will return one fewer column.

In some cases, creating many new columns with one-hot encoding could be problematic. This blog post gives an overview of when this might be a problem and possible solutions.

Grouping Categories¶

Rather than creating many new columns for all the possible categories, it may be appropriate to summarise them in sub-groups. For example, these are the categories in the “Preferred Language” column:

df["Preferred Language"].value_counts()

ENGLISH                     4459
SPANISH                      359
INDO-EUROPEAN                 55
ASIAN AND PACIFIC ISLAND      27
ALL OTHER LANGUAGES           26
AFRO-ASIATIC                  14
Name: Preferred Language, dtype: int64

If our research question related to finding differences between the care given to English and non-English speakers, we could encode the data as a single binary column (but you should consider this choice carefully, as we would not be able to identify differences between the other languages with this encoding):

prefer_english = (df["Preferred Language"] == "ENGLISH").astype(int)
prefer_english.value_counts()

1    4459
0     541
Name: Preferred Language, dtype: int64

The use of astype(int) above converts the created boolean series of True/False values to a series of ones (True) and zeros (False), which may be more suitable for input to a model.

Any value that is not “ENGLISH”, including missing values, will be encoded as 0 above, i.e., 0 means “unknown or not English”. It may be better to restore the missing values to avoid mislabelling unknown English speakers:

prefer_english[df["Preferred Language"].isnull()] = np.nan

Which Encoding to Choose?¶

The encodings described above are the most common, but there are others and there’s also no fixed rule that you must you a certain type of encoding for a certain type of data. There may be applications where one-hot encoding an ordinal column would be appropriate, for example.

Other Encoding Implementations¶

In the examples above we mostly convert the categorical values by defining Python dictionaries with the encodings and using the Pandas replace function. This works but there are other functions and libraries which can help do the job (and may be more maintainable and scalable depending on your use-case):

Pandas has a categorical data type, see here.
The LabelEncoder, OrdinalEncoder and OneHotEncoder classes in the scikit-learn library.
Some machine learning algorithms, such as CatBoost, have in-built support for categorical data.

Unseen Values¶

If you’re developing a model with categorical data, it’s important to consider what to do, or how the model will behave, with data from new groups that are not present in the original dataset. Say you plan to use a model developed on the 2017 PCS data on new data from 2019, but a new region in New York has been added. How will patients from that region be treated by the model?

This is essentially a different form of missing data (in this case data the model doesn’t know how to interpret), which we discuss later.

Data Dictionaries¶

We’ve talked about converting a dataset of strings into numerical representations that can be incorporated into a model or other further analysis. It’s also common to be given a dataset that has coded values (numbers or other identifiers) to start with. For example, what if we started with data where Education Status was encoded as values 1 to 5. How would we know what 3 means?

Datasets should come with “data dictionaries” that explain what these values mean. You could change the encoded values to their meanings by defining them in a Python dictionary and using replace in the same way as above. Or you may need to link to another table or dataset - see linking datasets, Section 2.2.6.

Research Data Science

2.2.4.3 Categorical Data

Contents