Filtering relies on a filter
key in the feature JSON, for which the corresponding value is an object:
{
"output_feature_name": "..."
...,
"filter": {
...
},
...
}
This vignette will describe the types of filters available and how they can be expressed in this JSON object.
Basic filters
The most basic filter would be to accept all rows where the value in a given column meets a certain criterion. For example, we may only want to look at rows for which the value in the age
column is greater than 18. There are three pieces of information here which must be encoded:
-
(string)
column
: the name of the column to filter on -
(string)
type
: the type of filter to apply -
value
: the value to compare against. The type of this must match the type of the column specified.
The type
key can take one of the following values:
-
"in"
In this case, the
value
key can either be a single value or an array of multiple values. This filers for rows where the value in the column matches one of the provided values. -
"gt"
/"lt"
/"gt_eq"
/"lt_eq"
The value in the column must be greater than, less than, greater than or equal to, or less than or equal to the
value
respectively. For these filters, only onevalue
can be passed. -
"date_gt"
/"date_lt"
/"date_gt_eq"
/"date_lt_eq"
The same as above, but to be used for dates. This distinction is mandatory because JSON does not have a native date type, so the
value
must be passed as a string in the formatYYYY-MM-DD
.
Thus, for example, if we assume that the column age
contains integers, then
{
"filter": {
"column": "age",
"type": "in",
"value": [18, 19, 20]
}
}
selects for people who are 18, 19, or 20 years old; and likewise if the date_of_birth
column contains dates, then
{
"filter": {
"column": "date_of_birth",
"type": "date_gt_eq",
"value": "2000-01-01"
}
}
selects for people born on or after January 1, 2000.
As a reminder, the table to be filtered is passed as the source_table
key in the feature JSON. This is described in the introductory vignette.
Compound filters
Filters may also be combined using the Boolean logic operators AND, OR, and NOT. Instead of providing the three keys column
, type
, and value
, only type
is specified, together with subfilter
, which is itself an object which maps filter names to filter objects.
An example is probably the easiest way to explain this. If we wanted to select for people who are not 18, 19, or 20 years old, we could write:
{
"filter": {
"type": "not",
"subfilter": {
"age_filter": {
"column": "age",
"type": "in",
"value": [18, 19, 20]
}
}
}
}
The filter name, age_filter
, can be any string you want: it is used only for logging purposes. Notice how the object that age_filter
refers to, i.e.,
{
"column": "age",
"type": "in",
"value": [18, 19, 20]
}
is itself just a basic filter as has already been described above.
or
and and
filters work in exactly the same way, with the sole difference being that to accomplish any meaningful results you would generally want the subfilter
object to contain two or more filters that work in tandem. For example, the below filters for 18-year-olds who have a height of over 180 (presumably centimetres; although this depends on the input table!):
{
"filter": {
"type": "and",
"subfilter": {
"age_filter": {
"column": "age",
"type": "in",
"value": [18]
},
"height_filter": {
"column": "height",
"type": "gt",
"value": 180
}
}
}
}
Again, age_filter
and height_filter
are just names for logging purposes, and themselves refer to basic filters.
Finally, note that the subfilters need not necessarily be basic filters, though: they can be compound filters themselves. So it is possible to nest filters as deeply as you like to accomplish the desired result.
Examples
Consider the following input table with five rows:
input_table <- data.frame(
id = c(1, 1, 1, 2, 3),
date = as.Date(c(
"2024-01-01", "2024-01-02", "2024-01-03",
"2023-01-01", "2023-01-01"
)),
diagnosis = c("A", "B", "C", "B", "C")
)
input_table
#> id date diagnosis
#> 1 1 2024-01-01 A
#> 2 1 2024-01-02 B
#> 3 1 2024-01-03 C
#> 4 2 2023-01-01 B
#> 5 3 2023-01-01 C
and suppose we would like to count the number of times each person has had a diagnosis of “B”. This is a transformation_type
of "count"
; but if we do not perform any filtering, "count"
will simply return the number of times each person appears in this dataframe.
The following JSON is provided as json_examples/filter1.json
:
{
"output_feature_name": "unfiltered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0
}
Passing this into run_pipeline()
will give us the following result:
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter1.json"
)
results$features
#> id unfiltered_count
#> 1 1 3
#> 2 2 1
#> 3 3 1
Instead of this, if we specify a filter for the diagnosis
column (in json_examples/filter2.json
), we can achieve the desired result:
{
"output_feature_name": "filtered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
}
}
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter2.json"
)
results$features
#> id filtered_count
#> 1 1 1
#> 2 2 1
#> 3 3 0
Note how patient 3 has a count of 0, because they have never been diagnosed with “B”. This is specified by the absent_default_value
key in the JSON.
If we would like to find diagnoses of “B” only in 2024, then we can add a date filter to the mix, and combine it with the diagnosis filter with an AND filter (json_examples/filter3.json
):
{
"output_feature_name": "doubly_filtered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"type": "and",
"subfilter": {
"diagnosis_filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
},
"date_filter": {
"column": "date",
"type": "date_gt_eq",
"value": "2024-01-01"
}
}
}
}
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter3.json"
)
results$features
#> id doubly_filtered_count
#> 1 1 1
#> 2 2 0
#> 3 3 0
Patient 2’s diagnosis has been excluded this time because the corresponding entry came in 2023.
See also
Filtering is used widely in the Gallery examples. See, for example:
- PIS feature 2 and SMR04 features 1 and 2 both use basic filters;
-
The LTC examples use filters to remove rows with missing data (
NA
values); - A&E feature 2 has a doubly nested compound filter.