Bayesian Network Approach
1. Background & Introduction
Infectious diseases are a constant threat to public health. Being able to estimate the burden and quantify the risk with respect to the location, timing and intensity are crucial for effective disease management. Understanding the role of risk factors like sociodemographic and environmental characteristics can add significantly to intervention planning and resource allocation. Predictive modeling techniques have the potential to leverage existing real-world data to overcome the deficiencies in programmatic data, to extract causal dependencies and influencing factors. The outputs thus generated can predict the risk of disease emergence in areas where risk factors are prevalent, before the actual onset of an outbreak, enabling the local authorities to mobilize efforts and resources in the right direction.
The EPCON model revolves around the creation of a digital twin or a digital representation of the real world. By combining real world context and evidence with programmatic data, the platform will generate an Epidemiological Twin Model that provides insights as to the present situation of the disease and its progression in time and space in a real-time, continuously evolving world.
The Epidemiological Twin Model as designed by EPCON uses the Mapping and Analysis for Tailored disease Control and Health system strengthening (MATCH) framework as designed by the Centre for Applied Spatial Epidemiology of KIT Royal Tropical Institute, Amsterdam, as a reference for its model. MATCH is a well-recognized framework outlining the use of Geographic Information Systems (GIS) to integrate routine health surveillance data with other data sources to inform health policy and planning.
2. Technology
The EPCON team developed a platform and technology framework enabling the generation of environmental and epidemiological twin models and the generation of output in desired format. The platform makes use of open-source components where possible to allow ease of adoption, sustainability, and low cost of ownership, and is built using Principles for Digital Development. The Artificial Intelligence and Machine Learning components are designed using Bayesian theorem, are adaptive in nature and support the distribution of intelligence at the edge, making the platform future proof, agile, interoperable, and able to integrate easily within an existing environment.
2.1 Artificial Intelligence and Bayesian Network Models
Infectious diseases are often multifactorial and difficult to predict. Being able to detect a minor change in a small population group and predicting the disease trajectory for a larger population can help to plan effectively and execute a timely response. Artificial intelligence (AI) allows machines to react to the inputs they received by performing cognitive functions. Thus, AI can allow us to analyze and interpret large amount of health-related data to predict disease progression and constantly adapt the outputs based on new data or learnings.
2.2 Justification for using Bayesian Network Modelling
Bayesian networks provide a powerful Artificial Intelligence (AI) technology for probabilistic reasoning and statistical inference that can be used to reason with uncertainty in complex environments - for example in this project where the underlying data (programmatic, socio-economic, and spatial-environmental datasets) with multiple variables that have many hidden relationships, having large amounts of natural variation, measurement errors, or missing values2-5. We used a machine learning (ML) engine to perform Bayesian learning (section 2.3) and Bayesian reasoning (section 2.4) that allows inference and predictive what-if queries on newly observed variables based on prior learning. A Bayesian network contains a set of predictor variables (represented as nodes), regardless of previously known associations with an outcome variable2-5. The links between the nodes represent informational or causal dependencies among the variables. The dependencies are given in terms of conditional probabilities of states that a node can have given the values of the parent nodes2-5. Each probability reflects a degree of belief. Degree of belief encodes causal dependencies. Degree of belief in any cause of a given effect is increased when the effect is observed, but then decreases when some other cause is found to be responsible for the observed effect. Causal dependencies can be derived from the knowledge of domain experts, or by mining the structure of the model from data by using unsupervised learning2-5.
2.3 Bayesian Learning
Bayesian learning can be viewed as finding the local maxima on the likelihood surface defined by the Bayesian network variables. Assume that the network must be trained from D, a set of data cases D1, ..., Dm generated independently from some underlying distribution. In each data case, values are given for some subset of the variables; this subset may differ from case to case – in other words, the data can be incomplete. During Bayesian Learning, the parameters ω of the conditional probability matrices that best model the data are calculated. Our ML engine uses unsupervised Bayesian Learning to mine hidden relationships from the training sets using a Hybrid Genetic Algorithm (HGA)6,7.
2.4 Bayesian Reasoning
Bayesian Belief Propagation is the process of finding the most probable explanation (MPE) in the presence of evidence. Following network structure generation, Bayesian inference can be performed to predict unknown variables in the network based on the states of observed nodes using Bayesian reasoning techniques. An algorithm, patented in South Africa and published by the World Intellectual Property Foundation (WIPO), is used to resolve queries in cyclical networks.8
2.5 Geospatial Prediction Stack Manager and data pipelines
The Geospatial Prediction Stack Manager (GPSM) is an interoperable data management and workflow tool designed to manage the collection and processing of temporal-geospatial data as well as the configuration and training of Bayes model(s). The GPSM schedules and performs temporal-geospatial data collection; it enriches this collected data (using self-organizing maps (SOM), naive Bayes classification, gap-filling, etc) and compiles the collected data into training and input sets. The GPSM facilitates the training and querying of Bayes, naive Bayes, linear regression, and SOM models. Scheduling and collection is multi-threaded via RabbitMQ. The GPSM stores the data as PostGIS tables and triggers Bayesian training and inference. The GPSM can collect any type of temporal-geospatial data provided that it is identified by a geometry (point or polygon) and date (if there is a temporal aspect). The GPSM most efficiently collects data from existing PostGIS tables but can also be configured to use custom python collector scripts.
2.6 Spatial-Temporal Environmental Features
TB transmission often occurs through person-to-person within a household or small community because prolonged duration of contact is typically required for infection to occur9, as well as through reactivation of latent infection in a group of people with shared risk factors9,10,11. Spatial-environmental analysis has been promoted for targeted TB control and intensified use of existing TB control tools12,13.
EPCON uses self-organizing maps (SOM) to extract patterns or features from multiple sources of spatial-environmental data that have been shown to influence incidence of tuberculosis. SOMs use unsupervised simple competitive learning based on Euclidean distance between nodes and the input vector to incrementally adjust the weights of the nearest matching node and its neighborhood. In this way, nodes in proximity represent frequent similar yet distinctive patterns while allowing for unusual patterns to be accounted for14,15
In essence, SOMs help to create a contextual profile value for a given location using multiple sources that may influence the desired output. (Economic indicators, population movement, disease prevalence, other) EPCON uses the SOM output value as an input variable for the Bayesian learning and reasoning process.
2.7 Thiessen Polygons
Thiessen polygons have multiple applications, and the technology has been used in many different fields like health, engineering, informatics, natural sciences, geometry to name a few. These are technically part of a Voronoi diagram, which is a partition of a plane into regions close to each of a given set of objects. In the simplest case, these objects are just finitely many points in the plane (called seeds, sites, or generators). For each seed there is a corresponding region, called Voronoi cells, consisting of all points of the plane closer to that seed than to any other. Voronoi cells are also known as Thiessen polygons. In epidemiology, Voronoi diagrams can be used to correlate sources of infections in epidemics. They were also used by John Snow to study the 1854 Broad Street cholera outbreak in Soho, England. The Thiessen polygons created for the purpose of this project are generated around population clusters, to contain an approximate population of 10000 individuals where possible. Usual administrative units like Tehsil are large to mobilize field workers for case finding activities. Thiessen polygons provide finer details about the demographics and other characteristics of an area, which can easily be tracked and monitored for case finding activities.
3. Validation and QA
3.1 Credible Intervals
Model output is obtained as inter-percentile ranges with an associated probability for each range - summing to 1 over all ranges. The probability density distribution is estimated by calculating the height of the probability distribution within each range. By discretizing the probability values over all ranges, we estimate the position of the 2.5th and 97.5th the percentiles to obtain the 95% Credible Interval for model predictions. We measure the percentage of observed values falling in the model predicted credible interval.