top of page

Bayesian Network Approach

1. Background & Introduction


Infectious diseases are a constant threat to public health. Being able to estimate the burden and quantify the risk with respect to the location, timing and intensity are crucial for effective disease management. Understanding the role of risk factors like sociodemographic and environmental characteristics can add significantly to intervention planning and resource allocation. Predictive modeling techniques have the potential to leverage existing real-world data to overcome the deficiencies in programmatic data, to extract causal dependencies and influencing factors. The outputs thus generated can predict the risk of disease emergence in areas where risk factors are prevalent, before the actual onset of an outbreak, enabling the local authorities to mobilize efforts and resources in the right direction. 


The EPCON model revolves around the creation of a digital twin or a digital representation of the real world. By combining real world context and evidence with programmatic data, the platform will generate an Epidemiological Twin Model that provides insights as to the present situation of the disease and its progression in time and space in a real-time, continuously evolving world. 


The Epidemiological Twin Model as designed by EPCON uses the Mapping and Analysis for Tailored disease Control and Health system strengthening (MATCH) framework as designed by the Centre for Applied Spatial Epidemiology of KIT Royal Tropical Institute, Amsterdam, as a reference for its model. MATCH is a well-recognized framework outlining the use of Geographic Information Systems (GIS) to integrate routine health surveillance data with other data sources to inform health policy and planning. 


2. Technology


The EPCON team developed a platform and technology framework enabling the generation of environmental and epidemiological twin models and the generation of output in desired format. The platform makes use of open-source components where possible to allow ease of adoption, sustainability, and low cost of ownership, and is built using Principles for Digital Development. The Artificial Intelligence and Machine Learning components are designed using Bayesian theorem, are adaptive in nature and support the distribution of intelligence at the edge, making the platform future proof, agile, interoperable, and able to integrate easily within an existing environment.


2.1 Artificial Intelligence and Bayesian Network Models 


Infectious diseases are often multifactorial and difficult to predict. Being able to detect a minor change in a small population group and predicting the disease trajectory for a larger population can help to plan effectively and execute a timely response. Artificial intelligence (AI) allows machines to react to the inputs they received by performing cognitive functions. Thus, AI can allow us to analyze and interpret large amount of health-related data to predict disease progression and constantly adapt the outputs based on new data or learnings.


2.2 Justification for using Bayesian Network Modelling 


Bayesian networks provide a powerful Artificial Intelligence (AI) technology for probabilistic reasoning and statistical inference that can be used to reason with uncertainty in complex environments - for example in this project where the underlying data (programmatic, socio-economic, and spatial-environmental datasets) with multiple variables that have many hidden relationships, having large amounts of natural variation, measurement errors, or missing values2-5. We used a machine learning (ML) engine to perform Bayesian learning (section 2.3) and Bayesian reasoning (section 2.4) that allows inference and predictive what-if queries on newly observed variables based on prior learning. A Bayesian network contains a set of predictor variables (represented as nodes), regardless of previously known associations with an outcome variable2-5. The links between the nodes represent informational or causal dependencies among the variables. The dependencies are given in terms of conditional probabilities of states that a node can have given the values of the parent nodes2-5. Each probability reflects a degree of belief.  Degree of belief encodes causal dependencies. Degree of belief in any cause of a given effect is increased when the effect is observed, but then decreases when some other cause is found to be responsible for the observed effect. Causal dependencies can be derived from the knowledge of domain experts, or by mining the structure of the model from data by using unsupervised learning2-5. 


2.3 Bayesian Learning


Bayesian learning can be viewed as finding the local maxima on the likelihood surface defined by the Bayesian network variables. Assume that the network must be trained from D, a set of data cases D1, ..., Dm generated independently from some underlying distribution. In each data case, values are given for some subset of the variables; this subset may differ from case to case – in other words, the data can be incomplete. During Bayesian Learning, the parameters ω of the conditional probability matrices that best model the data are calculated. Our ML engine uses unsupervised Bayesian Learning to mine hidden relationships from the training sets using a Hybrid Genetic Algorithm (HGA)6,7.  


2.4 Bayesian Reasoning


Bayesian Belief Propagation is the process of finding the most probable explanation (MPE) in the presence of evidence. Following network structure generation, Bayesian inference can be performed to predict unknown variables in the network based on the states of observed nodes using Bayesian reasoning techniques. An algorithm, patented in South Africa and published by the World Intellectual Property Foundation (WIPO), is used to resolve queries in cyclical networks.8

2.5 Geospatial Prediction Stack Manager and data pipelines 


The Geospatial Prediction Stack Manager (GPSM) is an interoperable data management and  workflow tool designed to manage the collection and processing of temporal-geospatial data as well as the configuration and training of Bayes model(s). The GPSM schedules and performs temporal-geospatial data collection; it enriches this collected data (using self-organizing maps (SOM), naive Bayes classification, gap-filling, etc) and compiles the collected data into training and input sets. The GPSM facilitates the training and querying of Bayes, naive Bayes, linear regression, and SOM models. Scheduling and collection is multi-threaded via RabbitMQ. The GPSM stores the data as PostGIS tables and triggers Bayesian training and inference. The GPSM can collect any type of temporal-geospatial data provided that it is identified by a geometry (point or polygon) and date (if there is a temporal aspect). The GPSM most efficiently collects data from existing PostGIS tables but can also be configured to use custom python collector scripts.


2.6 Spatial-Temporal Environmental Features 


TB transmission often occurs through person-to-person within a household or small community because prolonged duration of contact is typically required for infection to occur9, as well as through reactivation of latent infection in a group of people with shared risk factors9,10,11. Spatial-environmental analysis has been promoted for targeted TB control and intensified use of existing TB control tools12,13. 


EPCON uses self-organizing maps (SOM) to extract patterns or features from multiple sources of spatial-environmental data that have been shown to influence incidence of tuberculosis. SOMs use unsupervised simple competitive learning based on Euclidean distance between nodes and the input vector to incrementally adjust the weights of the nearest matching node and its neighborhood. In this way, nodes in proximity represent frequent similar yet distinctive patterns while allowing for unusual patterns to be accounted for14,15


In essence, SOMs help to create a contextual profile value for a given location using multiple sources that may influence the desired output. (Economic indicators, population movement, disease prevalence, other) EPCON uses the SOM output value as an input variable for the Bayesian learning and reasoning process. 


2.7 Thiessen Polygons


Thiessen polygons have multiple applications, and the technology has been used in many different fields like health, engineering, informatics, natural sciences, geometry to name a few. These are technically part of a Voronoi diagram, which is a partition of a plane into regions close to each of a given set of objects. In the simplest case, these objects are just finitely many points in the plane (called seeds, sites, or generators). For each seed there is a corresponding region, called Voronoi cells, consisting of all points of the plane closer to that seed than to any other. Voronoi cells are also known as Thiessen polygons. In epidemiology, Voronoi diagrams can be used to correlate sources of infections in epidemics. They were also used by John Snow to study the 1854 Broad Street cholera outbreak in Soho, England. The Thiessen polygons created for the purpose of this project are generated around population clusters, to contain an approximate population of 10000 individuals where possible. Usual administrative units like Tehsil are large to mobilize field workers for case finding activities. Thiessen polygons provide finer details about the demographics and other characteristics of an area, which can easily be tracked and monitored for case finding activities.


3. Validation and QA


3.1 Credible Intervals


Model output is obtained as inter-percentile ranges with an associated probability for each range - summing to 1 over all ranges. The probability density distribution is estimated by calculating the height of the probability distribution within each range. By discretizing the probability values over all ranges, we estimate the position of the 2.5th and 97.5th the percentiles to obtain the 95% Credible Interval for model predictions. We measure the percentage of observed values falling in the model predicted credible interval.


3.2 Mean Squared error (MSE), Root Mean Squared Error(RMSE)


MSE indicates how biased the model is, with values close to zero providing an indication that the model is unbiased.


3.3 Retrospective Validation


Retrospective validation is done by comparing model outputs from two weeks prior to the indicators calculated from new chest camp data. The purpose of this is to validate the model’s output on data it has not been trained on. A retrospective error that decreases over time indicates that the model represents the factors which influence TB in that area well. Conversely, a retrospective error that increases over time indicates insufficient training data to date or poor model performance. 

4. Strengths


Variable Selection: One of the benefits of using Bayesian networks is that the model identifies variables likely to contribute to an increased risk. These insights provide a deep understanding as to the gaps in the digital and population level data and usually result in a roadmap for improving and streamlining data and the use of real-world evidence. EPCONs engine uses collector agents that can easily consume data from a variety of agents (data points) and use them as contextual variables for model training or simply for visualization of usage levels and others. Including such variables and indicators in time will increase the dynamic aspect of the platform and stimulate usage across the cascade of care. Each stakeholder will and can be provided with information relevant to the execution of his/her daily routine. Ultimately improving the overall response. In addition, the GPSM, as outlined earlier, allows for ease of data modeling to fit the needs of the spatial and temporal resolutions as defined by the program and model. This component allows for the inclusion of new variables and the scheduling of model training by in-country data scientists.


Closed loop of Data: From its design the platform was developed to cater for a dynamic, adaptive, and incrementally learning environment. The idea behind this process is to use evidence generated in the field as input to generate, validate and calibrate the model and its predictions. As interventions and the supply of services are being executed successfully, the engine continuously learns what works and which areas are considered low risk. Ultimately this model will help the NTP and its partners to find the last missing patient and improve service levels across the different regions in function of risk, capacity, and desired yield. 



  1. Global tuberculosis report 2020. Geneva: World Health Organization; 2020. Licence: CC BY-NC-SA 3.0 IGO.

  2. Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference.  (Elsevier, 2014).

  3. Popescul, A., Pennock, D. M. & Lawrence, S. in Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence.  437-444 (Morgan Kaufmann Publishers Inc.).

  4. Russell, S., Binder, J., Koller, D. & Kanazawa, K. Local learning in probabilistic networks with hidden variables. in IJCAI.  1146-1152.

  5. Potgieter, A. & Bishop, J. in Proceedings of the 2001 International Conference on Intelligent Agents, Web Technologies and Internet Commerce (IAWTIC.

  6. Osunmakinde, I. O. & Potgieter, A. in Proceedings of Southern African Telecommunications Networks and Applications Conference.  8-9.

  7. Osunmakinde, I. O. & Potgieter, A. in Proceedings of the 10th Southern African Telecommunications Networks and Applications International Conference (SATNAC), Mauritius.

  8. World Intellectual Property Organization. Patent:  (US20040158815) Complex adaptive systems, <>

  9. Verma, A. et al. Accuracy of prospective space–time surveillance in detecting tuberculosis transmission. Spatial and spatio-temporal epidemiology 8, 47-54 (2014).

  10. Haase, I. et al. Use of geographic and genotyping tools to characterise tuberculosis transmission in Montreal. The International Journal of Tuberculosis and Lung Disease 11, 632-638 (2007).

  11. Keshavjee, S., Dowdy, D. & Swaminathan, S. Stopping the body count: a comprehensive approach to move towards zero tuberculosis deaths. The Lancet 386, e46-e47 (2015).

  12. Yates, T. A. et al. The transmission of Mycobacterium tuberculosis in high burden settings. The Lancet infectious diseases 16, 227-238 (2016).

  13. Theron, G. et al. Data for action: collection and use of local data to end tuberculosis. The Lancet 386, 2324-2333 (2015).

  14. Kohonen, T. The self-organizing map. Proceedings of the IEEE 78, 1464-1480 (1990).

  15. Kohonen, T. Self-organized formation of topologically correct feature maps. Biological cybernetics 43, 59-69 (1982).

  16. Guo, C. et al. Spatiotemporal analysis of tuberculosis incidence and its associated factors in mainland China. Epidemiology & Infection 145, 2510-2519 (2017).

  17. Li, Q. et al. The spatio-temporal analysis of the incidence of tuberculosis and the associated factors in mainland China, 2009-2015. Infection, Genetics and Evolution 75, 103949 (2019).

  18. Li, X.-X. et al. Seasonal variations in notification of active tuberculosis cases in China, 2005–2012. PloS one 8, e68102 (2013).

  19. Xiao, Y. et al. The influence of meteorological factors on tuberculosis incidence in Southwest China from 2006 to 2015. Scientific reports 8 (2018).

  20. Glaziou, P., Sismanidis, C., Zignol, M. & Floyd, K. Methods used by WHO to estimate the global burden of TB disease. WHO: Geneva, Switzerland (2016).

bottom of page