2.2.4.3 Categorical Data
Contents
2.2.4.3 Categorical Data¶
What is Categorical Data?¶
Categorical variables relate to data that is summarised in groups or has a limited set of possible values. Examples could be t-shirt size (small, medium, large, …), degree grade (first class, second class, …), or car manufacturer (BMW, Ford, Toyota, …).
In this section we will use the 2017 Patient Characteristics Survey (PCS) from New York state in the USA, which captures high-level information about the demographics and health of all people using a public mental health service in one week in 2017:
import pandas as pd
import numpy as np
df = pd.read_csv("data/pcs_2017.csv", na_values=["UNKNOWN", "NOT APPLICABLE"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Program Category 5000 non-null object
1 Region Served 5000 non-null object
2 Age Group 4999 non-null object
3 Sex 4992 non-null object
4 Living Situation 4764 non-null object
5 Household Composition 3886 non-null object
6 Preferred Language 4940 non-null object
7 Veteran Status 4879 non-null object
8 Employment Status 5000 non-null object
9 Number Of Hours Worked Each Week 789 non-null object
10 Education Status 4546 non-null object
11 Special Education Services 871 non-null object
12 Mental Illness 4955 non-null object
13 Intellectual Disability 4652 non-null object
14 Autism Spectrum 4688 non-null object
15 Alcohol Related Disorder 4788 non-null object
16 Drug Substance Disorder 4792 non-null object
17 Mobility Impairment Disorder 4721 non-null object
18 Alzheimer or Dementia 4430 non-null object
19 Neurological Condition 4430 non-null object
20 Traumatic Brain Injury 4430 non-null object
21 Cancer 4430 non-null object
22 Smokes 4774 non-null object
23 Received Smoking Medication 4659 non-null object
24 Received Smoking Counseling 4650 non-null object
25 Serious Mental Illness 4962 non-null object
26 Principal Diagnosis Class 4796 non-null object
27 SSI Cash Assistance 4448 non-null object
28 SSDI Cash Assistance 4451 non-null object
29 Public Assistance Cash Program 4388 non-null object
30 Other Cash Benefits 4432 non-null object
31 Three Digit Residence Zip Code 5000 non-null int64
dtypes: int64(1), object(31)
memory usage: 1.2+ MB
This dataset encodes missing information with the string “UNKNOWN” and unapplicable questions with the string “NOT APPLICABLE”. We’ve asked Pandas to treat these as NaN (null) values by passing them to the na_values
argument.
The original source data has over 175,000 patients and more than 60 columns. We’re using a smaller subset of the data here for teaching purposes.
All of the columns in this dataset, apart from the three-digit ZIP code, are categorical:
df.head()
Program Category | Region Served | Age Group | Sex | Living Situation | Household Composition | Preferred Language | Veteran Status | Employment Status | Number Of Hours Worked Each Week | ... | Smokes | Received Smoking Medication | Received Smoking Counseling | Serious Mental Illness | Principal Diagnosis Class | SSI Cash Assistance | SSDI Cash Assistance | Public Assistance Cash Program | Other Cash Benefits | Three Digit Residence Zip Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INPATIENT | HUDSON RIVER REGION | ADULT | FEMALE | INSTITUTIONAL SETTING | NaN | ENGLISH | NO | NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... | NaN | ... | NO | NO | NO | YES | MENTAL ILLNESS | NO | NO | NO | YES | 105 |
1 | SUPPORT | WESTERN REGION | CHILD | MALE | PRIVATE RESIDENCE | COHABITATES WITH OTHERS | ENGLISH | NO | NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... | NaN | ... | NO | NO | NO | YES | MENTAL ILLNESS | YES | NO | NO | NO | 138 |
2 | OUTPATIENT | WESTERN REGION | CHILD | FEMALE | PRIVATE RESIDENCE | COHABITATES WITH OTHERS | ENGLISH | NO | NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... | NaN | ... | NO | NO | NO | YES | MENTAL ILLNESS | NaN | NaN | NaN | NaN | 140 |
3 | OUTPATIENT | NEW YORK CITY REGION | CHILD | FEMALE | PRIVATE RESIDENCE | COHABITATES WITH OTHERS | ENGLISH | NO | NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... | NaN | ... | NO | NO | NO | NO | NaN | NO | NO | NO | NO | 113 |
4 | OUTPATIENT | LONG ISLAND REGION | CHILD | FEMALE | PRIVATE RESIDENCE | COHABITATES WITH OTHERS | ENGLISH | NO | NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ... | NaN | ... | NO | NO | NO | YES | MENTAL ILLNESS | NO | NO | NO | NO | 115 |
5 rows × 32 columns
Numerical Encodings for Categorical Data¶
Typically, it’s not possible to input strings representing categories directly into a machine learning algorithm or other analysis, so they must be transformed into a numerical form first. This also applies if we have coded categories (such as ID numbers) in the data - blindly feeding these into an algorithm can have unexpected results. Below we describe common ways of encoding categorical data into numeric forms.
Here we are talking encoding data with a countable number of categories. For data without a set of pre-defined options, like free text fields, it’s more appropriate to use other feature engineering techniques (such as those from Natural Language Processing - see Section 2.2.4.2).
Binary Data¶
A lot of the columns in this dataset have only two possible values, such as “Yes/No” questions like whether a patient smokes:
df["Smokes"].value_counts()
NO 3387
YES 1387
Name: Smokes, dtype: int64
In this case it’s natural to encode “NO” as 0 and “YES” as 1:
yesno_encoding = {
"YES": 1,
"NO": 0,
}
df["Smokes"] = df["Smokes"].replace(yesno_encoding)
df["Smokes"].value_counts()
0.0 3387
1.0 1387
Name: Smokes, dtype: int64
It’s also possible to replace YES/NO values in multiple columns at a time, e.g., with df[["Smokes", "Cancer"]].replace(yesno_encoding)
, or in the whole data frame with df.replace(yesno_encoding)
.
This can be applied to any data with only two possible values, a coin flip dataset with a column side = ["Heads", "Tails", "Heads"]
could be encoded as a single column heads = [1, 0, 1]
, for example.
Data with Multiple Categories¶
Ordinal Data¶
When the categories have a natural order (we could sort them), for example the number of hours worked per week in this dataset:
df["Number Of Hours Worked Each Week"].value_counts()
35 HOURS OR MORE 309
15-34 HOURS 251
01-14 HOURS 119
UNKNOWN EMPLOYMENT HOURS 110
Name: Number Of Hours Worked Each Week, dtype: int64
This is essentially binned data (see Section 2.2.3, feature engineering) - the source data is numeric, but we have it summarised in groups due to the design of the survey. It’s also common to see this for privacy/anonymisation reasons in public versions of sensitive datasets.
One approach here is to encode the data using the mid-point of each group:
hours_encoding = {
"01-14 HOURS": 7.5,
"15-34 HOURS": 24.5,
"35 HOURS OR MORE": 35,
"UNKNOWN EMPLOYMENT HOURS": np.nan,
}
df["Number Of Hours Worked Each Week"] = df["Number Of Hours Worked Each Week"].replace(
hours_encoding
)
We could use techniques for manipulating strings, like those seen in the Section 2.2.4.2, to extract the digits and calculate the mid-points rather than doing it manually.
How could you decide what value to use for the “35 HOURS OR MORE” group? Is 35 likely to be a good choice?
Alternatively, we can have ordinal data that was not numeric to start with, such as the “Education Status” column in this dataset:
df["Education Status"].value_counts()
MIDDLE SCHOOL TO HIGH SCHOOL 2585
SOME COLLEGE 737
COLLEGE OR GRADUATE DEGREE 699
PRE-K TO FIFTH GRADE 433
OTHER 68
NO FORMAL EDUCATION 24
Name: Education Status, dtype: int64
Here a sensible choice could be to encode the values by a ranking of how long the person was in education:
edu_encoding = {
"NO FORMAL EDUCATION": 0,
"PRE-K TO FIFTH GRADE": 1,
"MIDDLE SCHOOL TO HIGH SCHOOL": 2,
"SOME COLLEGE": 3,
"COLLEGE OR GRADUATE DEGREE": 4,
"OTHER": np.nan,
}
df["Education Status"] = df["Education Status"].replace(edu_encoding)
One-Hot Encoding¶
What if the categories don’t have a clear order? One example is “Region Served” in the patient characteristics survey:
df["Region Served"].value_counts()
NEW YORK CITY REGION 2486
WESTERN REGION 847
HUDSON RIVER REGION 775
CENTRAL NY REGION 534
LONG ISLAND REGION 358
Name: Region Served, dtype: int64
If we assigned each region a number from 1-5, an algorithm could try to infer relationships that don’t exist. For example, we may encode WESTERN_REGION = 2
and CENTRAL NY REGION = 4
, but this means we’re representing Central NY as “double” Western NY - double what?
Instead, a common choice is to “one-hot” encode the column, converting it into a binary column for each category. This is also referred to as creating “dummy variables”, and can be done with the pd.get_dummies
function:
pd.get_dummies(df["Region Served"])
CENTRAL NY REGION | HUDSON RIVER REGION | LONG ISLAND REGION | NEW YORK CITY REGION | WESTERN REGION | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... |
4995 | 0 | 0 | 0 | 0 | 1 |
4996 | 0 | 0 | 0 | 0 | 1 |
4997 | 1 | 0 | 0 | 0 | 0 |
4998 | 1 | 0 | 0 | 0 | 0 |
4999 | 0 | 0 | 1 | 0 | 0 |
5000 rows × 5 columns
The first patient was from the Hudson river region, so has one in the “HUDSON RIVER REGION” column and zero in the remaining columns, and so on. Treating the data this way allows a model to separately fit parameters explaining the relationship between each region and whatever we’re trying to predict.
Technically we only need four of the five columns above to represent the possible region values. If we didn’t include the “WESTERN REGION” column above, for example, we could still tell whether someone had care in that region by all the remaining columns being zero. Setting drop_first=True
in pd.get_dummies
will return one fewer column.
In some cases, creating many new columns with one-hot encoding could be problematic. This blog post gives an overview of when this might be a problem and possible solutions.
Grouping Categories¶
Rather than creating many new columns for all the possible categories, it may be appropriate to summarise them in sub-groups. For example, these are the categories in the “Preferred Language” column:
df["Preferred Language"].value_counts()
ENGLISH 4459
SPANISH 359
INDO-EUROPEAN 55
ASIAN AND PACIFIC ISLAND 27
ALL OTHER LANGUAGES 26
AFRO-ASIATIC 14
Name: Preferred Language, dtype: int64
If our research question related to finding differences between the care given to English and non-English speakers, we could encode the data as a single binary column (but you should consider this choice carefully, as we would not be able to identify differences between the other languages with this encoding):
prefer_english = (df["Preferred Language"] == "ENGLISH").astype(int)
prefer_english.value_counts()
1 4459
0 541
Name: Preferred Language, dtype: int64
The use of astype(int)
above converts the created boolean series of True/False values to a series of ones (True) and zeros (False), which may be more suitable for input to a model.
Any value that is not “ENGLISH”, including missing values, will be encoded as 0 above, i.e., 0 means “unknown or not English”. It may be better to restore the missing values to avoid mislabelling unknown English speakers:
prefer_english[df["Preferred Language"].isnull()] = np.nan
Which Encoding to Choose?¶
The encodings described above are the most common, but there are others and there’s also no fixed rule that you must you a certain type of encoding for a certain type of data. There may be applications where one-hot encoding an ordinal column would be appropriate, for example.
Other Encoding Implementations¶
In the examples above we mostly convert the categorical values by defining Python dictionaries with the encodings and using the Pandas replace
function. This works but there are other functions and libraries which can help do the job (and may be more maintainable and scalable depending on your use-case):
Pandas has a categorical data type, see here.
The
LabelEncoder
,OrdinalEncoder
andOneHotEncoder
classes in thescikit-learn
library.Some machine learning algorithms, such as CatBoost, have in-built support for categorical data.
Unseen Values¶
If you’re developing a model with categorical data, it’s important to consider what to do, or how the model will behave, with data from new groups that are not present in the original dataset. Say you plan to use a model developed on the 2017 PCS data on new data from 2019, but a new region in New York has been added. How will patients from that region be treated by the model?
This is essentially a different form of missing data (in this case data the model doesn’t know how to interpret), which we discuss later.
Data Dictionaries¶
We’ve talked about converting a dataset of strings into numerical representations that can be incorporated into a model or other further analysis. It’s also common to be given a dataset that has coded values (numbers or other identifiers) to start with. For example, what if we started with data where Education Status
was encoded as values 1 to 5. How would we know what 3 means?
Datasets should come with “data dictionaries” that explain what these values mean. You could change the encoded values to their meanings by defining them in a Python dictionary and using replace
in the same way as above. Or you may need to link to another table or dataset - see linking datasets, Section 2.2.6.