eider is a lightweight package for processing tabular
data in a declarative fashion. Users may specify a set of operations to
be performed on a table using JSON, which are then executed by the
package. The primary use case of eider is for extraction of
machine learning features from health data, but eider can
in principle be used for any kind of data.
The usage of eider in R source code itself is
straightforward, and consists of a single call to
run_pipeline().
Data
To illustrate this, we will construct some very simplistic data, which may be, for example, a record of patients who attended their GP and their associated complaints.
example_table <- data.frame(
patient_id = c(1, 1, 1, 2, 2, 3, 3, 3),
attendance_reason = c(6, 6, 7, 6, 6, 7, 7, 7)
)
data_sources <- list(attendances = example_table)In practice, it is more likely that you will be reading in data from
a file instead. For example, if you had a CSV file called
attendances.csv in the current working directory, you could
just do:
data_sources_2 <- list(attendances = "attendances.csv")eider allows you to mix and match data sources, so you
could have some data in a CSV file and some in an R data frame:
data_sources_3 <- list(
attendances = example_table, # A variable which has already been constructed
other_data = "other_data.csv" # A file to be read in
)This allows the user to, for example, perform preprocessing on a portion of their data if so needed.
Feature specification
Suppose we want to extract a feature corresponding to the total
number of times a patient attended for reason 6. eider
requires that the feature is specified as JSON, which looks like
this:
{
"transformation_type": "count",
"source_table": "attendances",
"grouping_column": "patient_id",
"absent_default_value": 0,
"output_feature_name": "total_attendances",
"filter": {
"column": "attendance_reason",
"type": "in",
"value": [
6
]
}
}
-
transformation_typetells you what kind of overall operation is being performed. This determines which other fields are required in the JSON. -
source_tablespecifies the name of the data source to be used in the list of data sources. -
grouping_columnspecifies the columns to group by. -
absent_default_valuespecifies what to do if there is no data for a particular patient ID. -
output_feature_namespecifies the name of the column to be created in the output table. -
filteris a filter object which is used to select rows from the input table which match particular conditions.
Subsequent vignettes will go into more detail about the different types of transformations and the required JSON fields for each of them.
Performing the transformation
To obtain the desired feature, we can place the JSON above in a file
(here json_examples/eider.json)
and simply do:
run_pipeline(
data_sources = data_sources,
feature_filenames = "json_examples/eider.json"
)
#> $features
#> id total_attendances
#> 1 1 2
#> 2 2 2
#> 3 3 0
#>
#> $responses
#> id
#> 1 1
#> 2 2
#> 3 3As expected, both patients 1 and 2 have attended for reason 6 twice, and patient 3 has not.
run_pipeline() returns a list of two data frames, called
features and responses respectively. These refer to
data used for training machine learning models: features are
the independent variables (i.e. X), and responses
are the dependent variables (i.e. y). For consistency,
eider always returns both of these data frames and ensures
that both of them have the same list of IDs. Responses may be specified
in exactly the same way as features, but using the
response_filenames argument instead of
feature_filenames.
As an alternative to placing the JSON in a file and providing the
filename to run_pipeline(), you can also provide the JSON
directly as a string:
json_string <- '{
"transformation_type": "count",
"source_table": "attendances",
"grouping_column": "patient_id",
"absent_default_value": 0,
"output_feature_name": "total_attendances",
"filter": {
"column": "attendance_reason",
"type": "in",
"value": [6]
}
}'
run_pipeline(
data_sources = data_sources,
feature_filenames = json_string
)
#> $features
#> id total_attendances
#> 1 1 2
#> 2 2 2
#> 3 3 0
#>
#> $responses
#> id
#> 1 1
#> 2 2
#> 3 3Next steps
If you want to read a more step-by-step guide to the features of
eider, move on to the next
vignette, which covers the different types of features that
eider lets you define.
Alternatively, jump ahead to the gallery
section to see some examples of features that you might use
eider to calculate.