Understanding Bayesian Networks: A Comprehensive Guide

A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). Each random variable encodes some aspect of the world for which we may have uncertainty. These random variables are usually denoted using capital letters such as:

$T$ : Is it raining?
$H$ : Heads or tails?
$Q$ : How long will it take to reach the front of the queue?

Random variables can be discrete or continuous. Each random variable is assigned a value from its domain. For example, $H \in \{\text{heads, tails}\}$ and $Q \in [0,\infty)$ .

Probability Basics

We assign a probability value to each value a random variable takes. For example, we can assign a value of $0.5$ to heads and $0.5$ to tails. There are three types of probability values:

Marginal Probability

Marginal probability $P(x)$ is the probability of a single event occurring without any consideration of other events. It is derived by summing or integrating over the possible values of other random variables.

Joint Probability

Joint probability $P(x,y)$ is the probability of two or more events occurring simultaneously. It is the probability of the intersection of events, representing the combined outcome of these events.

Conditional Probability

Conditional probability $P(x|y)$ is the probability of an event occurring given that another event has already occurred.

Probability Rules

probability rules allow us to establish relationships between different probabilities. For instance, we can relate marginal probabilities to joint probabilities 🎲

Law of Total Probability The law of total probability relates marginal and joint probabilities. It’s given by:

P(X) = \sum_{i} P(X, Y_i)

It states that the marginal probability of an event can be found by considering all possible ways that the event can occur jointly with the other variable.

Product Rule The product rule relates joint probability to conditional probability. It states that the joint probability of two variable $X$ and $Y$ can be expressed as:

P(X, Y) = P(X|Y)P(Y) = P(Y|X)P(X)

Similarly, the conditional probability can be expressed as:

P(X \mid Y) = \frac{P(X,Y)}{P(Y)}

Bayes' Rule The product rule allows us to compute the conditional probability from the joint probability. But what if the joint is not available. In that case Bayes' rule comes to the rescue. It allows us to compute the conditional probability $P(X \mid Y)$ from the inverse conditional $P(Y\mid X)$ . It is derived from the product rule and is expressed as:

P(X \mid Y) = \frac{P(Y|X) * P(Y)}{P(Y)}

Chain Rule The chain rule of probability allows us to express the joint probability of a sequence of events as the product of conditional probabilities. For a sequence of events $X_1, X_2, ..., X_n$ , the chain rule is given by:

P(X_1, X_2, ..., X_n) = P(X_1)P(X_2|X_1)P(X_3|X_1, X_2)\cdots P(X_n|X_1, X_2, ..., X_{n-1})

It is obtained by applying product rule over and over again.

Conditional Independence

Conditional independence is a fundamental concept in probability theory and Bayesian networks. Two random variables $X$ and $Y$ are conditionally independent given a third random variable $Z$ if the conditional probability distribution of $X$ given $Z$ is the same regardless of the value of $Y$ . Formally, $X$ and $Y$ are conditionally independent given $Z$ if:

P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)

or equivalently,

P(X | Y, Z) = P(X | Z)

This implies that knowing the value of $Y$ provides no additional information about $X$ once we know the value of $Z$ .

A Simple Bayes' Net

Let's create our first Bayesian Network. Consider a simple example where we have three random variables: $D$ , $T_1$ , and $T_2$ . Assume $D$ represents having some disease, and $T_1$ and $T_2$ are two test results.

The joint distribution of these three variables is given by chain rule:

P(D, T_1, T_2) = P(D) P(T_1|D) P(T_2|D, T_1)

To simplify our model, we will make a conditional independence assumption that $T_2$ and $T_1$ are independent given $C$ . This means:

P(T_2|D, T_1) = P(T_2|D)

Using this assumption, we can express the joint distribution as:

P(D, T_1, T_2) = P(D) P(T_1|D) P(T_2|D)

The Bayesian network corresponding to this assumption is shown on the right.

Another Example

Now lets look at an example that consists of five random variables: Burglary (B), Earthquake (E), Alarm (A), John Calls (J), and Mary Calls (M). Each of these variables represents an event that can either happen or not happen, and their relationships can be described using conditional dependencies.

Structure of the Bayesian Network

The Bayesian network for this scenario can be structured as follows:

Burglary (B) and Earthquake (E): These two events are considered independent of each other. A burglary occurring does not influence the probability of an earthquake and vice versa.

Alarm (A): The alarm depends on both the Burglary and Earthquake events. If either a burglary or an earthquake occurs, it can trigger the alarm.

John Calls (J) and Mary Calls (M): John and Mary calling are dependent on whether the alarm has gone off. If the alarm rings, it increases the likelihood that John or Mary will call.

The joint distribution of these variables can be represented as $P(B,E,A,J,M)$ . Using the chain rule of Bayesian networks (conditional independence assumptions implied in the Bayes Net structure), we can express this joint distribution in terms of the conditional probabilities:

P(B,E,A,J,M)=P(B)⋅P(E)⋅P(A∣B,E)⋅P(J∣A)⋅P(M∣A)

In a general Bayes Net, the joint probabilities are expressed as:

P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i))

where $\text{Parents}(X_i)$ denotes the set of parent nodes of $X_i$ .

Benefits of Using Bayesian Networks

Bayesian networks offer significant advantages in probabilistic modeling and reasoning under uncertainty:

Reduction in Parameter Complexity: The full joint distribution table of a set of variables requires $d^n$ probability values, where $d$ is the number of domain values of the random variable and $n$ is the number of variables. In contrast, Bayesian networks significantly reduce the number of parameters required, depending on the network's structure and the implied conditional independence assumptions.

Example Illustration: Let's break down the parameters in the example Bayesian network above:

Burglary (B): One parameter $P(b)$ representing the probability of a burglary occurring. The probability of a burglary not occurring can be obtained from $P(B)$ as follows: $P(\neg B) = 1 - P(B)$ ."
Earthquake (E): One parameter $P(E)$ representing the probability of an earthquake occurring.
Alarm (A): $P(A \mid B, E)$ requires 4 parameters because $A$ can depend on both $B$ and $E$ . The other 4 probabilities can be obtained using these 4.
John Calls (J): In the example, $P(J \mid A)$ is described by 2 parameters, i.e $P(j \mid a)$ and $P(j \mid \neg a)$ . The other two probabilities can be obtained as follows: $P(\neg j \mid a) = 1-P(j \mid a)$ and $P(\neg j \mid \neg a) = 1-P(j \mid \neg a)$
Mary Calls (M): Similarly, $P(M \mid A)$ requires 2 parameters.

Therefore, the total number of parameters in this Bayesian network example is $1 + 1 + 4 + 2 + 2 = 10$ . Each parameter captures a specific probability value that reflects the likelihood of events occurring based on the network structure and conditional dependencies. This structured approach reduces the complexity compared to a full joint probability table (which requires $d^n$ ), making Bayesian networks more efficient and effective for probabilistic modeling and inference tasks.

Conditional Probability Tables (CPTs)

Let's assume that the we are provided the conditional probabilities of all variables in the Bayesian network example above. These tables represent the conditional probabilities for each variable in the Bayesian network example. Each table specifies the probability of the variable given its parent variables' states, adhering to the network structure and the provided parameters.

$B$	$P(B)$
true	0.001
false	0.999

$E$	$P(E)$
true	0.002
false	0.998

$B$	$E$	$P(A \mid B, E)$
true	true	0.95
true	false	0.94
false	true	0.29
false	false	0.001

$A$	$P(J \mid A)$
true	0.90
false	0.05

$A$	$P(M \mid A)$
true	0.70
false	0.01

Inference by enumeration

To find the posterior probability of e.g burglary given that John and Mary have called, $P(B \mid j, m)$ , we can proceed as follows:

Apply the Product Rule:
$P(B \mid j, m) = \frac{P(B, j, m)}{P(j, m)}$
Normalization: Since $P(j, m)$ can be difficult to compute directly, we use normalization trick:
$P(B \mid j, m) = \alpha P(B, j, m)$
where $\alpha$ is a normalization constant, which will be computed later.
Use the Law of Total Probability: We expand $P(B, j, m)$ using the law of total probability over all possible values of other variables in the network:
$P(B \mid j, m) = \alpha \sum_e \sum_a P(B, j, m, e, a)$
Apply the Bayesian Chain Rule: Next, we use the Bayesian chain rule to decompose $P(B, j, m, e, a)$ into a product of conditional probabilities. Given the structure of the Bayesian network, we have:
$P(B, j, m, e, a) = P(B) \cdot P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a)$

Putting it all together:

P(B \mid j, m) = \alpha \sum_e \sum_a P(B) \cdot P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a)

By summing over all possible values of $E$ (earthquake) and $A$ (alarm), and then normalizing the result, we can compute $P(B \mid j, m)$ , the posterior probability of burglary given that John and Mary have called. The normalization constant $\alpha$ ensures that the total probability sums to 1.

Now, we can compute this using the given values. We'll calculate for $B = \text{true}$ and $B = \text{false}$ separately.

P(b \mid j, m) = \alpha P(b) \cdot \sum_e \sum_a P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a)

P(b \mid j, m)= \alpha 0.0006

P(\neg b \mid j, m)= \alpha 0.0015

Normalize

Now we need to normalize these probabilities to find $\alpha$ . $P(b \mid j, m) + P(\neg b \mid j, m) =1$ :

\alpha = 1 / \left( P(b, j, m) + P(\neg b, j, m) \right)

Substitute the values:

\alpha = \frac{1}{0.0006 + 0.0015} = \frac{1}{0.0021} = 479.92

Finally, compute $P(B \mid j, m)$ for both cases:

P(b \mid j, m) = \alpha \cdot P(b, j, m) = 479.92 \cdot 0.0006 = 0.28375

P(\neg b \mid j, m) = \alpha \cdot P(\neg b, j, m) = 479.92 \cdot 0.0015= 0.71625

Thus, the posterior probability of a burglary given that John and Mary have called is:

P(b \mid j, m) = 0.28