CamelEdge
machine learning

Understanding Naive Bayes: A Simple and Effective Classification Algorithm

Understanding Naive Bayes: A Simple and Effective Classification Algorithm
Table Of Content

    By CamelEdge

    Updated on Fri Jul 26 2024


    Naive Bayes is one of the simplest yet most effective classification algorithms in machine learning. It is particularly useful for tasks involving a large set of features and where the assumption of feature independence holds true. Despite its simplicity, Naive Bayes often performs surprisingly well in various real-world applications.

    The Naive Bayes Model

    YX1X2Xn...

    At its core, Naive Bayes relies on Bayes' theorem, which relates the conditional and marginal probabilities of random events. A simple Naive Bayes Network typically includes one hidden variable YY (the class label we want to predict) and multiple observed variables (the features). X1,X2,,XnX_1, X_2, \ldots, X_n. This can be applied to various classification problems, such as:

    • Spam Classification: Given an email, predict whether it is spam or not.
    • Medical Diagnosis: Given a list of symptoms, predict whether a patient has a specific disease.
    • Weather Prediction: Based on temperature, humidity, etc., predict if it will rain tomorrow.

    Naive Bayes Assumptions

    Naive Bayes makes a strong independence assumption: the observed features are independent of each other given the class label YY. This simplifies the computation significantly and is mathematically expressed as:

    P(YX1,,Xn)=αP(Y)i=1nP(XiY)P(Y \mid X_1, \ldots, X_n) = \alpha P(Y) \prod_{i=1}^{n} P(X_i \mid Y)

    where α\alpha is a normalization constant.

    Learning and Inference

    To perform classification using Naive Bayes, we need to:

    1. Learn the Prior and Likelihood: Using the training data, we estimate the prior probabilities P(Y)P(Y) and the likelihood probabilities P(XiY)P(X_i \mid Y).

    2. Compute the Posterior: For a new observation, we compute the posterior probability P(YX1,,Xn)P(Y \mid X_1, \ldots, X_n) using the learned prior and likelihood.

    Example

    Consider the following training dataset with three Boolean features X1,X2,X3X_1, X_2, X_3 and a Boolean classification variable YY:

    The objective is to compute the class of a new observation X1=F,X2=T,X3=FX_1=F, X_2=T, X_3=F i.e find the posterior probability P(YX1=F,X2=T,X3=F)P(Y \mid X_1=F, X_2=T, X_3=F).

    X1X_1X2X_2X3X_3YY
    TTTT
    TFTF
    TFFF
    FTTF
    FFFT

    We will start by learning the Bayes net using the training data. The table on the right shows 5 training examples.

    1. Learning the Bayesian Network
    • Estimating priors P(Y=T)=25P(Y=T) = \frac{2}{5}, since there are 2 examples out of 5 where Y=TY=T P(C=F)=35P(C=F) = \frac{3}{5}

    • Estimate Likelihoods:** P(X1Y),P(X2Y),P(X3Y)P(X_1 \mid Y), P(X_2 \mid Y), P(X_3 \mid Y) The likelihoods are calculated by counting the occurrence of each observation given the class ( Y ). For example, to compute ( P(X_1 = T \mid Y = T) ), we first count the number of training examples where ( Y = T ). Then, among these examples, we count how many have ( X_1 = T ). This gives us ( P(X_1 = T \mid Y = T) = \frac12 ).

    X1X_1YYP(X1Y)P(X_1 \mid Y)
    TT1/2
    TF1/2
    FT2/3
    FF1/3
    X2X_2YYP(X2Y)P(X_2 \mid Y)
    TT1/2
    TF1/2
    FT1/3
    FF2/3
    X3X_3YYP(X3Y)P(X_3 \mid Y)
    TT1/2
    TF1/2
    FT2/3
    FF1/3
    1. Inference: To compute the class of a new observation X1=F,X2=T,X3=FX_1=F, X_2=T, X_3=F i.e find the posterior probability P(YX1=F,X2=T,X3=F)P(Y \mid X_1=F, X_2=T, X_3=F), we use the Naive bayes chain rule:
    P(YX1=F,X2=T,X3=F)=αP(Y)P(X1=FY)P(X2=TY)P(X3=FY)P(Y \mid X_1=F, X_2=T, X_3=F) = \alpha P(Y) P(X_1=F \mid Y) P(X_2=T \mid Y) P(X_3=F \mid Y)

    This gives us:

    P(Y=TX1=F,X2=T,X3=F)=α120P(Y = T \mid X_1=F, X_2=T, X_3=F) = \alpha \frac{1}{20}

    P(Y=FX1=F,X2=T,X3=F)=α145P(Y = F \mid X_1=F, X_2=T, X_3=F) = \alpha \frac{1}{45}

    After normalization, we get:

    P(Y=TX1=F,X2=T,X3=F)=0.69P(Y = T \mid X_1=F, X_2=T, X_3=F) = 0.69

    P(Y=FX1=F,X2=T,X3=F)=0.31P(Y = F \mid X_1=F, X_2=T, X_3=F) = 0.31