Analyzer Guides

Analyze reported outputs from a collection of a simulations.

Most of EMOD generated output files are in json format, and the remainder are csv’s. In emodpy, Analyzer functions facilitate extracting information from EMOD’s raw output files to produce results in csv’s or as figures.

Required modules

import os
import pandas as pd
import numpy as np
from idmtools.entities import IAnalyzer 
from idmtools.entities.simulation import Simulation
from idmtools.analysis.analyze_manager import AnalyzeManager
from idmtools.core import ItemType
from idmtools.core.platform_factory import Platform

Other modules such as datetime may also be helpful, depending on the type of output and desired manipulations.

Basic structure of an Analyzer

The Analyzer Class should be defined using a meaningful name, i.e. InsetChartAnalyzer for an analyzer that processes InsetChart.json outputs. A class in Python allows constructing custom objects with associated attributes and functions. Each class starts with a definition of custom parameters and objects attributes via __init__ and self. This requires a few edits across simulation experiments but tends to stay relatively unchanged within the same experiment setup.

The filter function is optional. It allows the analyzer to only analyze a subset of simulations in an experiment by filtering based on simulation tags. For example, the user can request the analyzer only analyze simulations where the tag SMC_Coverage has the value 1 (if this is a tag the user specified when building their experiment), or simulations that succeeded. This functionality can be useful when debugging large experiments. If the filter function is not specified, then all simulations are targeted for analysis.

The map is a custom function applied to extract data from the EMOD output file and needs to be modified the most across different EMOD output file types and according to the user’s needs.

Finally, the reduce checks the extracted data, aggregates data from multiple simulations in the same experiment, and then saves or plots the data. The checking of the simulation stays mostly the same across simulations while the part of processing the simulation data is highly variable across projects and depends on the desired results.

class InsetChartAnalyzer(IAnalyzer): ...
    # 1 - Definition of custom parameters and object attributes
    def __init__(self, ...)
    # Optional
    def filter(self, simulation):
        ...
    # 2 - Extract and select data from the json output files
    def map(self, data, simulation):
        ...
    # 3 - Check the extracted data and then save or plot the data
    def reduce(self, all_data):
        ...

More information on the general structure and functioning of analyzers that work with EMOD output sis available in the idmtools documentation

Analyze InsetChart

The InsetChartAnalyzer is used to explain the Analyzer structure in detail.

Setup analyzer class & define variables

You don’t need to understand the python fundamentals in depth, just what each line does and what to modify. - The first two lines including the __init__(self,...) and super is required in each analyzer class. Be sure to update the analyzer name in super. - The second line filenames=["output/InsetChart.json"] defines the EMOD output file to be analyzed. It is written in a list so that analyzers have the capability of combining data from multiple files. Generally we use 1 output file per analyzer. - The next lines that start with self attach each argument that the user has passed to the analyzer (expt_name, etc) to the analyzer class via self. This allows easy access to any of these values from any analyzer function via the self object. - This analyzer allows the user to specify the parameters expt_name, sweep_variables , channels, and start_year. Generally we use expt_name and sweep_variables across all analyzers we write, while the others are specific to this particular analyzer. All these requested parameters can be modified or extended with additional parameters if needed, according to the user’s needs. - These parameters allow the analyzer to take in experiment specific values, for instance simulation start_year is used to convert timesteps into date-time values, as we generally run EMOD in simulation time instead of calendar time. - The expt_name parameter lets the user specify the name of the experiment. We often use the experiment name in the file names of outputs from the analyzer, for example aggregated csv’s and figures. - The sweep_variables parameter is a list of simulation tags from the experiment that the user would like attached to each simulation. For example, Run_Number to track the random seed, or SMC_Coverage if the experiment sweeps over SMC coverage. - The channels is an optional parameter as it takes default values if not specified. It is included in this analyzer so the user has the flexibility to extract different channels from InsetChart.json if needed. If the same channels are always used, one could instead hard code the desired channel names into self.channels and remove the optional argument.

class MonthlyInsetChartAnalyzer(IAnalyzer):
    def __init__(self, expt_name, sweep_variables=None, channels=None, working_dir=".", start_year=2022):
        super(MonthlyInsetChartAnalyzer, self).__init__(working_dir=working_dir, filenames=["output/InsetChart.json"])
        self.sweep_variables = sweep_variables or ["Run_Number"]
        self.channels = channels or  ['Statistical Population', 'New Clinical Cases', 'New Severe Cases', 'PfHRP2 Prevalence']
        self.expt_name = expt_name
        self.start_year = start_year

Map simulation data

The map is a custom function that will change the most when adapting an analyzer to different EMOD outputs. Do not change the function definition (the line beginning def map()).

EMOD output from the requested output file(s) is stored in data. The first activity of map() is therefore to extract the desired data out of data. data is a dictionary where the keys are the filenames stored in self.filenames and the values are the content of each file.

In this example using InsetChart.json, we read in the data from the json file, keeping only channels that have been specified in self.channels, and convert into a pandas dataframe. The dataframe will have one column for each channel, and each row is the channel value for each timestep in the simulation.

Next, we want to convert the timesteps (row number) into calendar dates. Those are the next 5 lines. We copy simdata.index into simdata['Time'] and create additional variables for Day, Month and Year that are easier to work with, as well as a date column that is a datetime.date object.

Finally, the sweep variables corresponding to the simulation tags of the experiment are attached, the dataframe is returned, and the returned dataframe is automatically passed on to the next and final step of the analyzer. It is not required to return a dataframe but it is required to return something: the data of interest from the simulation.

    def map(self, data, simulation):
        simdata = pd.DataFrame({x: data[self.filenames[0]]['Channels'][x]['Data'] for x in self.inset_channels})
        simdata['Time'] = simdata.index
        simdata['Day'] = simdata['Time'] % 365
        simdata['Month'] = simdata['Day'].apply(lambda x: self.monthparser((x + 1) % 365))
        simdata['Year'] = simdata['Time'].apply(lambda x: int(x / 365) + self.start_year)
        simdata['date'] = simdata.apply(lambda x: datetime.date(int(x['Year']), int(x['Month']), 1), axis=1)
        for sweep_var in self.sweep_variables:
            if sweep_var in simulation.tags.keys():
                simdata[sweep_var] = simulation.tags[sweep_var]
        return simdata

Reduce

This part checks the simulation data returned by map() and aggregates data across all simulations in the experiment into the adf dataframe. In this example, the analyzer saves results into the specified working_dir/expt_name subfolder. Other analyzers may use the reduce() function to plot and save a figure.

We typically do not modify the first 4 lines of reduce() (creation of selected and checking that it contains data). If map() returns a dataframe, then the adf = ... line can stay the same as well. Everything after that should be customized to the user’s needs.

    def map(self, all_data):
        selected = [data for sim, data in all_data.items()]
        if len(selected) == 0:
            print("No data have been returned... Exiting...")
            return
        adf = pd.concat(selected).reset_index(drop=True)
        if not os.path.exists(os.path.join(self.working_dir, self.expt_name)):
            os.mkdir(os.path.join(self.working_dir, self.expt_name))
        adf.to_csv(os.path.join(self.working_dir, self.expt_name, 'All_Age_Monthly_Cases.csv'), index=False)

Optional analyzer extensions and helper functions

For instance selecting only simulations with SMC_Coverage at 0.5:

    def filter(self, simulation):
        return simulation.tags["SMC_Coverage"] == 0.5

Helper function to convert months.

    @classmethod
    def monthparser(self, x):
        if x == 0:
            return 12
        else:
            return datetime.datetime.strptime(str(x), '%j').month