CamelEdge
machine learning

Understanding Bayesian Networks: A Comprehensive Guide

Understanding Bayesian Networks: A Comprehensive Guide
Table Of Content

    By CamelEdge

    Updated on Thu Jul 25 2024


    A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). Each random variable encodes some aspect of the world for which we may have uncertainty. These random variables are usually denoted using capital letters such as:

    • TT: Is it raining?
    • HH: Heads or tails?
    • QQ: How long will it take to reach the front of the queue?

    Random variables can be discrete or continuous. Each random variable is assigned a value from its domain. For example, H{heads, tails}H \in \{\text{heads, tails}\} and Q[0,)Q \in [0,\infty).


    Probability Basics

    We assign a probability value to each value a random variable takes. For example, we can assign a value of 0.50.5 to heads and 0.50.5 to tails. There are three types of probability values:

    Marginal Probability

    Marginal probability P(x)P(x) is the probability of a single event occurring without any consideration of other events. It is derived by summing or integrating over the possible values of other random variables.

    Joint Probability

    Joint probability P(x,y)P(x,y) is the probability of two or more events occurring simultaneously. It is the probability of the intersection of events, representing the combined outcome of these events.

    Conditional Probability

    Conditional probability P(xy)P(x|y) is the probability of an event occurring given that another event has already occurred.

    Probability Rules

    probability rules allow us to establish relationships between different probabilities. For instance, we can relate marginal probabilities to joint probabilities 🎲

    Law of Total Probability The law of total probability relates marginal and joint probabilities. It’s given by:

    P(X)=iP(X,Yi)P(X) = \sum_{i} P(X, Y_i)

    It states that the marginal probability of an event can be found by considering all possible ways that the event can occur jointly with the other variable.

    Product Rule The product rule relates joint probability to conditional probability. It states that the joint probability of two variable XX and YY can be expressed as:

    P(X,Y)=P(XY)P(Y)=P(YX)P(X)P(X, Y) = P(X|Y)P(Y) = P(Y|X)P(X)

    Similarly, the conditional probability can be expressed as:

    P(XY)=P(X,Y)P(Y)P(X \mid Y) = \frac{P(X,Y)}{P(Y)}

    Bayes' Rule The product rule allows us to compute the conditional probability from the joint probability. But what if the joint is not available. In that case Bayes' rule comes to the rescue. It allows us to compute the conditional probability P(XY)P(X \mid Y) from the inverse conditional P(YX)P(Y\mid X). It is derived from the product rule and is expressed as:

    P(XY)=P(YX)P(Y)P(Y)P(X \mid Y) = \frac{P(Y|X) * P(Y)}{P(Y)}

    Chain Rule The chain rule of probability allows us to express the joint probability of a sequence of events as the product of conditional probabilities. For a sequence of events X1,X2,...,XnX_1, X_2, ..., X_n, the chain rule is given by:

    P(X1,X2,...,Xn)=P(X1)P(X2X1)P(X3X1,X2)P(XnX1,X2,...,Xn1)P(X_1, X_2, ..., X_n) = P(X_1)P(X_2|X_1)P(X_3|X_1, X_2)\cdots P(X_n|X_1, X_2, ..., X_{n-1})

    It is obtained by applying product rule over and over again.

    Conditional Independence

    Conditional independence is a fundamental concept in probability theory and Bayesian networks. Two random variables XX and YY are conditionally independent given a third random variable ZZ if the conditional probability distribution of XX given ZZ is the same regardless of the value of YY. Formally, XX and YY are conditionally independent given ZZ if:

    P(X,YZ)=P(XZ)P(YZ) P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)

    or equivalently,

    P(XY,Z)=P(XZ)P(X | Y, Z) = P(X | Z)

    This implies that knowing the value of YY provides no additional information about XX once we know the value of ZZ.


    A Simple Bayes' Net

    Let's create our first Bayesian Network. Consider a simple example where we have three random variables: DD, T1T_1, and T2T_2. Assume DD represents having some disease, and T1T_1 and T2T_2 are two test results.

    The joint distribution of these three variables is given by chain rule:

    P(D,T1,T2)=P(D)P(T1D)P(T2D,T1)P(D, T_1, T_2) = P(D) P(T_1|D) P(T_2|D, T_1)

    To simplify our model, we will make a conditional independence assumption that T2T_2 and T1T_1 are independent given CC. This means:

    P(T2D,T1)=P(T2D)P(T_2|D, T_1) = P(T_2|D)
    DT1T2

    Using this assumption, we can express the joint distribution as:

    P(D,T1,T2)=P(D)P(T1D)P(T2D)P(D, T_1, T_2) = P(D) P(T_1|D) P(T_2|D)

    The Bayesian network corresponding to this assumption is shown on the right.


    Another Example

    Now lets look at an example that consists of five random variables: Burglary (B), Earthquake (E), Alarm (A), John Calls (J), and Mary Calls (M). Each of these variables represents an event that can either happen or not happen, and their relationships can be described using conditional dependencies.

    Structure of the Bayesian Network

    BEAJM

    The Bayesian network for this scenario can be structured as follows:

    Burglary (B) and Earthquake (E): These two events are considered independent of each other. A burglary occurring does not influence the probability of an earthquake and vice versa.

    Alarm (A): The alarm depends on both the Burglary and Earthquake events. If either a burglary or an earthquake occurs, it can trigger the alarm.

    John Calls (J) and Mary Calls (M): John and Mary calling are dependent on whether the alarm has gone off. If the alarm rings, it increases the likelihood that John or Mary will call.

    The joint distribution of these variables can be represented as P(B,E,A,J,M)P(B,E,A,J,M). Using the chain rule of Bayesian networks (conditional independence assumptions implied in the Bayes Net structure), we can express this joint distribution in terms of the conditional probabilities:

    P(B,E,A,J,M)=P(B)P(E)P(AB,E)P(JA)P(MA)P(B,E,A,J,M)=P(B)⋅P(E)⋅P(A∣B,E)⋅P(J∣A)⋅P(M∣A)

    In a general Bayes Net, the joint probabilities are expressed as:

    P(X1,X2,,Xn)=i=1nP(XiParents(Xi))P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i))

    where Parents(Xi)\text{Parents}(X_i) denotes the set of parent nodes of XiX_i.

    Benefits of Using Bayesian Networks

    Bayesian networks offer significant advantages in probabilistic modeling and reasoning under uncertainty:

    Reduction in Parameter Complexity: The full joint distribution table of a set of variables requires dnd^n probability values, where dd is the number of domain values of the random variable and nn is the number of variables. In contrast, Bayesian networks significantly reduce the number of parameters required, depending on the network's structure and the implied conditional independence assumptions.

    Example Illustration: Let's break down the parameters in the example Bayesian network above:

    • Burglary (B): One parameter P(b)P(b) representing the probability of a burglary occurring. The probability of a burglary not occurring can be obtained from P(B)P(B) as follows: P(¬B)=1P(B)P(\neg B) = 1 - P(B)."

    • Earthquake (E): One parameter P(E)P(E) representing the probability of an earthquake occurring.

    • Alarm (A): P(AB,E)P(A \mid B, E) requires 4 parameters because AA can depend on both BB and EE. The other 4 probabilities can be obtained using these 4.

    • John Calls (J): In the example, P(JA)P(J \mid A) is described by 2 parameters, i.e P(ja)P(j \mid a) and P(j¬a)P(j \mid \neg a). The other two probabilities can be obtained as follows: P(¬ja)=1P(ja)P(\neg j \mid a) = 1-P(j \mid a) and P(¬j¬a)=1P(j¬a)P(\neg j \mid \neg a) = 1-P(j \mid \neg a)

    • Mary Calls (M): Similarly, P(MA)P(M \mid A) requires 2 parameters.

    Therefore, the total number of parameters in this Bayesian network example is 1+1+4+2+2=101 + 1 + 4 + 2 + 2 = 10. Each parameter captures a specific probability value that reflects the likelihood of events occurring based on the network structure and conditional dependencies. This structured approach reduces the complexity compared to a full joint probability table (which requires dnd^n), making Bayesian networks more efficient and effective for probabilistic modeling and inference tasks.

    Conditional Probability Tables (CPTs)

    Let's assume that the we are provided the conditional probabilities of all variables in the Bayesian network example above. These tables represent the conditional probabilities for each variable in the Bayesian network example. Each table specifies the probability of the variable given its parent variables' states, adhering to the network structure and the provided parameters.

    BBP(B)P(B)
    true0.001
    false0.999
    EEP(E)P(E)
    true0.002
    false0.998
    BBEEP(AB,E)P(A \mid B, E)
    truetrue0.95
    truefalse0.94
    falsetrue0.29
    falsefalse0.001
    AAP(JA)P(J \mid A)
    true0.90
    false0.05
    AAP(MA)P(M \mid A)
    true0.70
    false0.01

    Inference by enumeration

    To find the posterior probability of e.g burglary given that John and Mary have called, P(Bj,m)P(B \mid j, m), we can proceed as follows:

    1. Apply the Product Rule:

      P(Bj,m)=P(B,j,m)P(j,m)P(B \mid j, m) = \frac{P(B, j, m)}{P(j, m)}
    2. Normalization: Since P(j,m)P(j, m) can be difficult to compute directly, we use normalization trick:

      P(Bj,m)=αP(B,j,m)P(B \mid j, m) = \alpha P(B, j, m)

      where α\alpha is a normalization constant, which will be computed later.

    3. Use the Law of Total Probability: We expand P(B,j,m)P(B, j, m) using the law of total probability over all possible values of other variables in the network:

      P(Bj,m)=αeaP(B,j,m,e,a)P(B \mid j, m) = \alpha \sum_e \sum_a P(B, j, m, e, a)
    4. Apply the Bayesian Chain Rule: Next, we use the Bayesian chain rule to decompose P(B,j,m,e,a)P(B, j, m, e, a) into a product of conditional probabilities. Given the structure of the Bayesian network, we have:

      P(B,j,m,e,a)=P(B)P(e)P(ab,e)P(ja)P(ma)P(B, j, m, e, a) = P(B) \cdot P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a)

    Putting it all together:

    P(Bj,m)=αeaP(B)P(e)P(ab,e)P(ja)P(ma)P(B \mid j, m) = \alpha \sum_e \sum_a P(B) \cdot P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a)

    By summing over all possible values of EE (earthquake) and AA (alarm), and then normalizing the result, we can compute P(Bj,m)P(B \mid j, m), the posterior probability of burglary given that John and Mary have called. The normalization constant α\alpha ensures that the total probability sums to 1.

    Now, we can compute this using the given values. We'll calculate for B=trueB = \text{true} and B=falseB = \text{false} separately.

    P(bj,m)=αP(b)eaP(e)P(ab,e)P(ja)P(ma)P(b \mid j, m) = \alpha P(b) \cdot \sum_e \sum_a P(e) \cdot P(a \mid b, e) \cdot P(j \mid a) \cdot P(m \mid a) P(bj,m)=α0.0006P(b \mid j, m)= \alpha 0.0006 P(¬bj,m)=α0.0015P(\neg b \mid j, m)= \alpha 0.0015

    Normalize

    Now we need to normalize these probabilities to find α\alpha. P(bj,m)+P(¬bj,m)=1P(b \mid j, m) + P(\neg b \mid j, m) =1:

    α=1/(P(b,j,m)+P(¬b,j,m))\alpha = 1 / \left( P(b, j, m) + P(\neg b, j, m) \right)

    Substitute the values:

    α=10.0006+0.0015=10.0021=479.92\alpha = \frac{1}{0.0006 + 0.0015} = \frac{1}{0.0021} = 479.92

    Finally, compute P(Bj,m)P(B \mid j, m) for both cases:

    P(bj,m)=αP(b,j,m)=479.920.0006=0.28375P(b \mid j, m) = \alpha \cdot P(b, j, m) = 479.92 \cdot 0.0006 = 0.28375 P(¬bj,m)=αP(¬b,j,m)=479.920.0015=0.71625P(\neg b \mid j, m) = \alpha \cdot P(\neg b, j, m) = 479.92 \cdot 0.0015= 0.71625

    Thus, the posterior probability of a burglary given that John and Mary have called is:

    P(bj,m)=0.28P(b \mid j, m) = 0.28