Daedalus is a novel dynamic spatial microsimulation pipeline that allows users to produce (custom) population projections for policy intervention analysis. Currently, it provides simulation utilities for the whole of the United Kingdom at the local authority (LA) level.
Daedalus is a novel dynamic spatial microsimulation pipeline that allows users to produce (custom) population projections for policy intervention analysis. Currently, it provides simulation utilities for the whole of the United Kingdom at the local authority (LA) level.
Daedalus builds on code written by Andrew Smith, available here https://github.com/nismod/microsimulation/
Daedalus is being developed in collaboration between Leeds Institute for Data Analytics and the Alan Turing Institute as part of the SPENSER (Synthetic Population Estimation and Scenario Projection Model) project.
We strongly recommend installation via Anaconda:
Create a new environment for Daedalus:
conda create -n daedalus python=3.7
conda activate daedalus
git clone https://github.com/alan-turing-institute/daedalus.git
cd /path/to/my/daedalus
pip install -v -e .
Daedalus can be run via command line. The following command displays all available options:
python scripts/run.py --help
Output:
usage: run.py [-h] -c config-file [--location LOCATION]
[--input_data_dir INPUT_DATA_DIR]
[--persistent_data_dir PERSISTENT_DATA_DIR]
[--output_dir OUTPUT_DIR]
Dynamic Microsimulation
optional arguments:
-h, --help show this help message and exit
-c config-file, --config config-file
the model config file (YAML)
--location LOCATION LAD code
--input_data_dir INPUT_DATA_DIR
directory where the input data is
--persistent_data_dir PERSISTENT_DATA_DIR
directory where the persistent data is
--output_dir OUTPUT_DIR
directory where the output data is saved
For example, to run a simulation for LAD E08000032
:
:warning: This takes around 2 to 3 hours (depending on your machine) to finish.
python scripts/run.py -c config/default_config.yaml --location E08000032 --input_data_dir data --persistent_data_dir persistent_data --output_dir output
In the above command:
E08000032
. Note that, Daedalus can be run in parallel for several LADs,
refer to section: Speeding up simulations over several LADs by parallelizationdata
where ssm_E08000032_MSOA11_ppp_2011.csv
is located.As an example, when running the above command, Daedalus store the results in the following directory structure:
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
├── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
├── year_1
│ └── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csv
with the following messages on the terminal:
❯ python scripts/run.py -c config/default_config.yaml --location E08000032 --input_data_dir data --persistent_data_dir persistent_data --output_dir output ─╯
Start Population Size: 524213
Write config file successful
Write the dataset at: output_single/E08000032/ssm_E08000032_MSOA11_ppp_2011_processed.csv
Computing immigration OD matrices...
Computing internal migration rate table...
Caching rate table...
Cached to persistent_data/internal_migration_rate_table_1.csv
Computing mortality rate table...
Caching rate table...
Cached to persistent_data/mortality_rate_table_1.csv
Computing fertility rate table...
Caching rate table...
Cached to persistent_data/fertility_rate_table_1.csv
Computing emigration rate table...
Caching rate table...
Cached to persistent_data/emigration_rate_table_1.csv
Computing immigration rate table...
Caching rate table...
Cached to persistent_data/immigration_rate_table_E08000032_1.csv
Computing total immigration number for location E08000032
Start simulation setup
2020-10-30 10:18:26
2020-10-30 10:18:26.363 | DEBUG | vivarium.framework.values:register_value_modifier:373 - Registering metrics.1.population_manager.metrics as modifier to metrics
2020-10-30 11:05:54.951 | DEBUG | vivarium.framework.values:_register_value_producer:323 - Registering value pipeline int_outmigration_rate
.
.
.
2020-10-30 13:58:07.094 | DEBUG | vivarium.framework.engine:step:140 - 2012-12-01 01:30:00
Finished running simulation for year: 2
2020-10-30 14:01:03
In year: 2013
alive 539883
dead 8797
emigrated 2984
internal migration 551664
New children 18547
Immigrants 8904
Finished running the full simulation
alive 539883
dead 8797
emigrated 2984
internal migration 551664
New children 18547
Immigrants 8904
In the previous section, we ran the simulation over one LAD (specified by --location E08000032
).
The simulation took around 2 to 3 hours to finish.
To speed up the simulations over severals LADs, Daedalus can be run in parallel.
For example, the following command runs various LAD codes (specified by --path_pop_files "data/ssm_*ppp*csv"
, wildcard accepted)
on five processes in parallel (specified by --process_np 5
):
python scripts/parallel_run.py -c config/default_config.yaml --path_pop_files "data/ssm_*ppp*csv" --input_data_dir data --persistent_data_dir persistent_data --output_dir output --process_np 5
In this command:
--path_pop_files "data/ssm_*ppp*csv"
, LAD codes of all files ssm_*ppp*csv
will be used.data
where ssm_*ppp*csv
are located.The following command displays all available options:
python scripts/parallel_run.py --help
After running the simulation in section: Run Daedalus via command line,
the results are stored in a directory specified by --output_dir
, e.g., output
in the command above.
In our example, it contains the following dirs/files because we ran the simulation for 2 years:
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
├── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
├── year_1
│ └── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csv
To evaluate the results, we need to:
LAD_code_1 ---> LAD_code_2
should be added to the population file of LAD_code_2
.
This step is required since Daedalus works and stores the results at LAD level.:warning: Note that in the example of section: Run Daedalus via command line,
we simulated only one directory. However, in realistic examples, more than one LAD will be stored in output
directory.
The above two steps can be run via one command line:
python scripts/validation.py --simulation_dir output --persistent_data_dir persistent_data
:warning: Note that the above command requires the following directory structure (created by Daedalus command line in section: Run Daedalus via command line). If by any reason the simulation process for an LAD is not completed, this directory structure might be different (e.g maybe only one year got saved) and the validation code will break when it is trying to find files in this structure. To solve this, you will need to remove the LAD directories where the simulation did not fully finish:
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
└── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
└── year_1
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csv
The following command displays all available options:
python scripts/validation.py --help
Next, we plot the results in this notebook.
In another notebook,
the results are plotted on maps.
We use the cartopy
library to plot maps in this notebook.
cartopy
is not installed by default. Please follow the instructions here:
https://scitools.org.uk/cartopy/docs/latest/installing.html
The following directories/files are needed to run a Daedalus simulation:
persistent_data
directory on the repo.persistent_data
directory on the repo.data
directory on the repo.If you are planning to run the microsimulation pipeline on the LADs
E09000001
, E09000033
, E06000052
and E06000053
beware that
the rates of these LADs are merged in the following way:
E09000001+E09000033
E06000052+E06000053
(For all other LADs, the rates are at individual level).
It is still possible to run the simulations for E09000001
, E09000033
, E06000052
and E06000053
individually,
but the pipeline will use the combined rates and immigrated values as specified above.
The most appropriate way to deal with this is to run the microsimulation from a combined LAD starting file,
instead of individually. For example, to run the simulation for the LADs E09000001+E09000033
,
ssm_E09000001+E09000033_MSOA11_ppp_2011.csv
that contains
the starting population from both E09000001
and E09000033
.python scripts/run.py -c config/default_config.yaml --location E09000001+E09000033 --input_data_dir data --persistent_data_dir persistent_data --output_dir output
Similarly, this should be done for E06000052+E06000053
Daedalus reads a config file specified by -c
flag
(see section: Run Daedalus via command line).
An example config file is provided on the repo, see: config/default_config.yaml
This config file contains the following options:
randomness:
key_columns: ['entrance_time', 'age']
input_data:
location: 'UK'
time:
start: {year: 2011, month: 1, day: 1}
end: {year: 2012, month: 1, day: 1}
step_size: 30.4375 # Days
num_years: 2
population:
age_start: 0
age_end: 100
mortality_file: 'Mortality2011_LEEDS1_2.csv'
fertility_file: 'Fertility2011_LEEDS1_2.csv'
emigration_file: 'Emig_2011_2012_LEEDS2.csv'
immigration_file: 'Immig_2011_2012_LEEDS2.csv'
total_population_file: 'MY2011AGEN.csv'
msoa_to_lad: 'Middle_Layer_Super_Output_Area__2011__to_Ward__2016__Lookup_in_England_and_Wales.csv'
OD_matrix_dir: 'od_matrices'
OD_matrix_index_file: 'MSOA_to_OD_index.csv'
internal_outmigration_file: 'InternalOutmig2011_LEEDS2.csv'
immigration_MSOA : 'Immigration_MSOA_M_F.csv'
ethnic_lookup: 'ethnic_lookup.csv'
components : [TestPopulation(),InternalMigration(), Mortality(), Emigration(), FertilityAgeSpecificRates(),Immigration()]
scale_rates:
# methods:
# constant: all rates regardless of age/sex/... will be multiplied by the specified factor
# if 1, the original rates will be usd
method: "constant"
constant:
mortality: 1
fertility: 1
emigration: 1
immigration: 1
internal_migration: 1