Understanding Naive Bayes: A Simple and Effective Classification Algorithm

Naive Bayes is one of the simplest yet most effective classification algorithms in machine learning. It is particularly useful for tasks involving a large set of features and where the assumption of feature independence holds true. Despite its simplicity, Naive Bayes often performs surprisingly well in various real-world applications.

The Naive Bayes Model

At its core, Naive Bayes relies on Bayes' theorem, which relates the conditional and marginal probabilities of random events. A simple Naive Bayes Network typically includes one hidden variable $Y$ (the class label we want to predict) and multiple observed variables (the features). $X_1, X_2, \ldots, X_n$ . This can be applied to various classification problems, such as:

Spam Classification: Given an email, predict whether it is spam or not.
Medical Diagnosis: Given a list of symptoms, predict whether a patient has a specific disease.
Weather Prediction: Based on temperature, humidity, etc., predict if it will rain tomorrow.

Naive Bayes Assumptions

Naive Bayes makes a strong independence assumption: the observed features are independent of each other given the class label $Y$ . This simplifies the computation significantly and is mathematically expressed as:

P(Y \mid X_1, \ldots, X_n) = \alpha P(Y) \prod_{i=1}^{n} P(X_i \mid Y)

where $\alpha$ is a normalization constant.

Learning and Inference

To perform classification using Naive Bayes, we need to:

Learn the Prior and Likelihood: Using the training data, we estimate the prior probabilities $P(Y)$ and the likelihood probabilities $P(X_i \mid Y)$ .
Compute the Posterior: For a new observation, we compute the posterior probability $P(Y \mid X_1, \ldots, X_n)$ using the learned prior and likelihood.

Example

Consider the following training dataset with three Boolean features $X_1, X_2, X_3$ and a Boolean classification variable $Y$ :

The objective is to compute the class of a new observation $X_1=F, X_2=T, X_3=F$ i.e find the posterior probability $P(Y \mid X_1=F, X_2=T, X_3=F)$ .

$X_1$	$X_2$	$X_3$	$Y$
T	T	T	T
T	F	T	F
T	F	F	F
F	T	T	F
F	F	F	T

We will start by learning the Bayes net using the training data. The table on the right shows 5 training examples.

Learning the Bayesian Network

Estimating priors $P(Y=T) = \frac{2}{5}$ , since there are 2 examples out of 5 where $Y=T$ $P(C=F) = \frac{3}{5}$
Estimate Likelihoods:** $P(X_1 \mid Y), P(X_2 \mid Y), P(X_3 \mid Y)$ The likelihoods are calculated by counting the occurrence of each observation given the class ( Y ). For example, to compute ( P(X_1 = T \mid Y = T) ), we first count the number of training examples where ( Y = T ). Then, among these examples, we count how many have ( X_1 = T ). This gives us ( P(X_1 = T \mid Y = T) = \frac12 ).

$X_1$	$Y$	$P(X_1 \mid Y)$
T	T	1/2
T	F	1/2
F	T	2/3
F	F	1/3

$X_2$	$Y$	$P(X_2 \mid Y)$
T	T	1/2
T	F	1/2
F	T	1/3
F	F	2/3

$X_3$	$Y$	$P(X_3 \mid Y)$
T	T	1/2
T	F	1/2
F	T	2/3
F	F	1/3

Inference: To compute the class of a new observation $X_1=F, X_2=T, X_3=F$ i.e find the posterior probability $P(Y \mid X_1=F, X_2=T, X_3=F)$ , we use the Naive bayes chain rule:

P(Y \mid X_1=F, X_2=T, X_3=F) = \alpha P(Y) P(X_1=F \mid Y) P(X_2=T \mid Y) P(X_3=F \mid Y)

This gives us:

$P(Y = T \mid X_1=F, X_2=T, X_3=F) = \alpha \frac{1}{20}$

$P(Y = F \mid X_1=F, X_2=T, X_3=F) = \alpha \frac{1}{45}$

After normalization, we get:

$P(Y = T \mid X_1=F, X_2=T, X_3=F) = 0.69$

$P(Y = F \mid X_1=F, X_2=T, X_3=F) = 0.31$

Understanding Naive Bayes: A Simple and Effective Classification Algorithm

Introduction

The Naive Bayes Model

Naive Bayes Assumptions

Learning and Inference

Example