Demystifying Artificial Neural Networks: A Comprehensive Guide to AI's Building Blocks

Artificial Neural Networks (ANNs) are a cornerstone of modern machine learning, inspired by the biological neural networks. They are designed to recognize patterns, learn from data, and make decisions with minimal human intervention. This article provides a comprehensive review of the essential components and concepts of ANNs, including convolution, perceptron, loss minimization, generalized backpropagation, activation functions, and the differences between regression and classification. We will also explore a practical classification example to illustrate these concepts in action.

Let's explore this concept with an image classification example where the goal is to categorize a given image into one of many possible classes. For instance, the ImageNet Large Scale Visual Recognition Challenge dataset contains 1,000 different classes, and our task is to classify the image into one of these categories. We'll refer to the input image as ( x ) and the corresponding output class as ( y ).

From Manual Feature Engineering to Automated Learning

In the early days of image classification, the process of extracting meaningful information from images was a labor-intensive task. Researchers and engineers had to manually design and select features—distinct characteristics or patterns within an image—that could be used to identify and classify objects. This approach, known as manual feature engineering, often involved the use of various filters and mathematical transformations to highlight edges, textures, shapes, or other relevant aspects of an image. These hand-crafted features were then fed into traditional machine learning algorithms, such as support vector machines (SVMs) or decision trees, to perform the classification.

While this method yielded significant advancements, it came with its own set of challenges. Designing effective features required deep domain expertise, and the features that worked well for one type of data or task might not be applicable to another. Moreover, this process was time-consuming and often involved trial and error to fine-tune the feature extraction techniques.

However, with the advent of deep learning, the landscape of image classification has drastically changed. Deep learning algorithms, particularly artificial neural networks (ANNs), have the remarkable ability to automatically learn and extract features directly from the data, bypassing the need for manual feature engineering.

Artificial neural networks, inspired by the structure and function of the human brain, consist of layers of interconnected nodes (or "neurons"). These networks can automatically learn hierarchical representations of the input data through a process called backpropagation. In the context of image classification, this means that rather than manually designing features, we can feed raw pixel data into a neural network, and the network will learn the most relevant features on its own.

This automatic feature learning is a game-changer. Instead of relying on hand-crafted filters, the network learns to recognize low-level features like edges and textures in the initial layers, and then combines these into more complex patterns like shapes and objects in deeper layers. This hierarchical feature learning allows the network to capture intricate details and abstract representations that are difficult, if not impossible, to manually engineer.

The shift from manual feature engineering to automated learning has led to unprecedented advances in the field of image classification. It has not only improved the accuracy of models but also broadened their applicability to a wide range of tasks and domains. This is evident in the success of deep learning models across various applications, from medical image analysis and autonomous driving to facial recognition and beyond.

In summary, while manual feature engineering laid the foundation for early image classification techniques, the advent of deep learning and artificial neural networks has revolutionized the field. By enabling automated feature learning directly from data, these algorithms have opened up new possibilities and set new standards for what is achievable in image classification and beyond.

Understanding Artificial Neural Networks: A Simple 1D Example

Artificial Neural Networks (ANNs) are a powerful tool in machine learning, designed to replicate the way the human brain processes information. Let's start with a simple example to understand how ANNs work in a 1D scenario where both the input ( $x$ ) and output ( $y$ ) variables are one-dimensional.

Example: Predicting House Prices

Consider a dataset that includes information on house sizes and their corresponding prices:

Size in feet² ( $x$ )	Price ($) in 1000's ( $y$ )
2104	460
1416	232
1534	315
852	178
...	...

In this example:

$m$ represents the number of training examples, or the total number of houses in your dataset.
$x$ 's are the input variables or features, in this case, the sizes of the houses.
$y$ 's are the output or target variables, which are the prices of the houses.
$(x, y)$ denotes a single training example, and $(x^i, y^i)$ refers to the i-th training example in your dataset.

The Perceptron: Building Block of ANNs

At the core of ANNs is the perceptron, the simplest form of a neural network. A perceptron consists of input values, weights, a bias, and an activation function that determines the output. This basic unit can be scaled up to form more complex networks capable of handling larger and more intricate datasets.

The Hypothesis Function

Perceptron

To make predictions, the neural network uses a hypothesis function, which is a mathematical equation. For a 1D case, where there's only one input feature ( $x$ ), the hypothesis function is:

h_w(x) = w_0 + w_1 x

Here, $w_0$ is the bias (the intercept), and $w_1$ is the weight (the slope) associated with the input feature ( x ). This function allows the network to predict the output based on the input data.

If the problem involves more than one input feature, say two features ( $x_1$ and $x_2$ ), the hypothesis function expands to:

h_w(x) = w_0 + w_1 x_1 + w_2 x_2

In matrix notation:

\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ 1 \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ b \end{bmatrix}

Here, $\mathbf{x}$ is a column vector including the input features and an additional 1 for the bias term. $\mathbf{w}$ is a column vector including the weights and the bias.

The hypothesis function in vector form becomes:

h_{\mathbf{w}}(\mathbf{x}) = \mathbf{w}^T \mathbf{x}

where $\mathbf{w}^T$ is the transpose of the weight vector.

This notation allows for a compact and generalizable representation of the hypothesis function, which is especially useful as the number of features increases.

The Goal: Minimizing Error

The primary goal of training an ANN is to learn the optimal parameters (weights and biases) that minimize the error between the predicted output $h(x)$ and the actual target value $y$ . This is done through an iterative process where the network adjusts these parameters to improve the accuracy of its predictions.

Mean Squared Error (MSE) Loss Function

The Mean Squared Error (MSE) loss function is commonly used to compute the error between the predicted output $h(x)$ and the actual target value $y$ . It measures the average squared difference between the predicted values and the actual values, providing a way to quantify how well the model's predictions match the ground truth.

For a dataset with $m$ training examples, where each training example consists of an input $x^i$ and an output $y^i$ , the MSE loss function is defined as:

J(w_0, w_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_{\mathbf{w}}(x^i) - y^i \right)^2

where:

$J(w_0, w_1)$ is the cost function or loss function.
$h_\mathbf{w}(x^i)$ is the hypothesis function for the $i$ -th training example, $h_\mathbf{w}(x^i) = \mathbf{w}^T \mathbf{x}^i$
$y^i$ is the actual output value for the $i$ -th training example.
$m$ is the total number of training examples.

The goal is to find the values for $w_0$ and $w_1$ (in general for $\mathbf{w}$ ) that minimize this loss function. By minimizing the MSE, we are essentially adjusting the weights to make the predictions as close as possible to the actual target values.

The figure below illustrates the shape of the hypothesis function for various values of $w_0$ and $w_1$ , and its impact on the loss function. The blue points on the graph represent the data points corresponding to $x$ and $y$ . The red line is drawn using the values of $w_0$ and $w_1$ . The objective is to find the optimal weights that minimize the loss, as demonstrated in the middle figure.

Effect of varying $w_0$ , $w_1$ on the loss function.

It is impractical to manually test different values and select the best one based on minimal loss. Instead, we use an iterative procedure called gradient descent to efficiently find the minimum loss.

Gradient Descent for Minimizing MSE

To minimize the MSE loss function, we typically use optimization techniques like Gradient Descent. Gradient Descent updates the weights iteratively based on the gradient (or derivative) of the loss function with respect to the weights.

When we plot the loss over various values of $w_0$ and $w_1$ we get surface like the one shown below. We are interested in finding the $w_0$ and $w_1$ which give us the lowest point.

Landscape of the Loss function

Beginning with an initial estimate of the weights, we will utilize the directions provided by derivatives to systematically progress toward the minimum. As we approach the minimum, the gradient becomes zero, prompting the algorithm to halt.

Compute the Gradient: Calculate the gradient of the loss function with respect to each weight:
$\frac{\partial J}{\partial w_0} = \frac{1}{m} \sum_{i=1}^{m} \left( h_{\mathbf{w}}(x^i) - y^i \right)$ $\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^{m} \left( h_{\mathbf{w}}(x^i) - y^i \right) x^i$
Update the Weights: Adjust the weights in the direction that reduces the loss function:
$w_0 := w_0 - \alpha \frac{\partial J}{\partial w_0}$ $w_1 := w_1 - \alpha \frac{\partial J}{\partial w_1}$
Here, $\alpha$ is the learning rate, which controls the size of the step taken during each update.

By iterating this process, the weights $w_0$ and $w_1$ are adjusted to minimize the MSE. The blue points on the graph below represent the data points corresponding to x and y. The red line is drawn using the values of w0 and w1. The right-side graph visually illustrates how gradient descent systematically minimizes the loss function.

Implementation using Pytorch

Let's apply a basic perceptron to solve a simple 1-D regression problem using PyTorch

import torch
from torch import nn
from torch import optim
import numpy as np
from matplotlib import pyplot as plt

# Generating random 1D data
np.random.seed(42)
x_data = np.sort(-2. + 4. * np.random.rand(20))
y_data = 5. * x_data + 2.5 + np.random.randn(20)

X = torch.tensor(x_data[:,np.newaxis], dtype=torch.float32)
Y = torch.tensor(y_data[:,np.newaxis], dtype=torch.float32)

# Linear regression hypothesis using Pytorch
h = nn.Linear(1, 1, bias=True) # h = w x + b

# Gradient Descent optimizer 
optimizer = optim.SGD(h.parameters(), lr = .1)
Cost = nn.MSELoss() # mean squared error
# Run for 50 iterations
for i in range(50):
    optimizer.zero_grad()
    out = h(X)
    loss = Cost(out, Y)
    loss.backward()
    optimizer.step()

out = h(X)
loss = Cost(out, Y)

plt.plot(x_data, y_data, 'b.')
x = torch.tensor(np.linspace(-2.0, 2.0, 100).reshape(-1,1), dtype=torch.float32)
y = h(x).detach().numpy()
plt.plot(x,y,'r'), 
plt.ylim([-15, 15]), plt.xlim([-2, 2])
plt.xlabel('x'), plt.ylabel('y')
plt.title(f'$w_0$={h.weight.item():.1f}, $w_1$={h.bias.item():.1f}, loss={loss.item()/20:.2f}')
plt.grid('on')
plt.show()

Artifical Neural Network

A single-layer artificial neural network is formed by stacking perceptrons, with a single hidden layer consisting of two perceptrons. This configuration is called a "single-layer" network because it only has one layer of perceptrons between the input and the output.

A single-layer, multi-output neural network can generate multiple outputs, as shown in the figure. For instance, in a multi-class classification problem, a single-layer network can have multiple output neurons, each corresponding to a different class.

As we move towards more complex problems, single-layer networks may not be sufficient to capture the intricate patterns and relationships within the data. This is where multi-layer artificial neural networks come into play.

A general artificial neural network comprises multiple hidden layers, each containing numerous neurons or perceptrons. These networks are often referred to as "deep" networks, where the term "deep" indicates the presence of multiple hidden layers.

Activation function

In the discussion of artificial neural networks (ANNs), an essential component that we have yet to explore is the activation function. The activation function introduces nonlinearity into the network, which is crucial for enabling the network to learn and represent non-linear complex patterns and relationships within the data.

If all the neurons in ANN use an identity activation function, also known as a linear activation function, the network behaves in a purely linear fashion. This means that, regardless of the number of layers or the network's architecture, the overall function that the network computes remains linear.

In mathematical terms, a linear activation function for a neuron is simply:

f(x) = x

Where $x$ is the input to the neuron. If every neuron in the network uses this linear function, the output of the network is just a linear transformation of the input. Consequently, the network's capacity to capture and model complex, non-linear relationships in the data is severely limited. Below we discuss two activation functions.

Common Nonlinear Activation Functions

Sigmoid Function:

The sigmoid function maps input values to a range between 0 and 1.
It is often used in binary classification problems.
The function is defined as: $\sigma(x) = \frac{1}{1 + e^{-x}}$
While it is useful for certain applications, the sigmoid function can suffer from issues like vanishing gradients, especially in deep networks.

Rectified Linear Unit (ReLU):

ReLU is one of the most widely used activation functions in deep learning.
It introduces nonlinearity by outputting the input directly if it is positive and zero otherwise: $\text{ReLU}(x) = \max(0, x)$
ReLU is computationally efficient and helps mitigate the vanishing gradient problem, making it particularly useful in deep networks. However, it can suffer from "dying ReLU" where neurons can become inactive for all inputs.

Generalized Backpropagation Algorithm

The Generalized Backpropagation Algorithm is a cornerstone of deep learning, enabling artificial neural networks (ANNs) to learn from data by optimizing their weights and biases. This algorithm is based on gradient-based optimization that iteratively adjusts the parameters of the network to minimize the error between the predicted outputs and the actual target values.

Backpropagation works by propagating the error from the output layer back through the network, layer by layer, to update the weights. This process involves two main stages: the forward pass and the backward pass.

Forward Pass: During this stage, the input data is passed through the network, layer by layer, until it reaches the output layer. At each layer, the input is transformed by the weights and biases, and then passed through an activation function to produce the output for that layer. By the end of the forward pass, the network has produced a set of predictions.

Backward Pass: After computing the output, the backward pass begins by calculating the error between the network's predictions and the actual target values. This error is then propagated backward through the network. During this process, the algorithm calculates the gradient of the error with respect to each weight using the chain rule of calculus. These gradients indicate how much each weight contributes to the error, guiding how the weights should be adjusted to reduce the error.
Update Weights: Once the gradients are computed, the weights of the network are updated.

Application

Let's apply Artificial Neural Networks (ANNs) to classify the MNIST handwritten digits dataset. This dataset contains images of handwritten digits, each sized 32x32 pixels. The goal is to accurately classify each digit image into one of the 10 possible categories. We will use PyTorch to do that.

# Imports
import torch
from torch import nn
from torch import optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms
from torchvision.transforms import ToTensor

import numpy as np
from matplotlib import pyplot as plt

We first download the MNIST dataset. The Dataset and DataLoader simplify this process by providing an easy-to-use API for downloading, batching and shuffling data. For more details, refer to the PyTorch Data Loading Tutorial.

training_data = datasets.MNIST(
    root='~/Downloads/',
    train=True,
    download=True,
    transform=transforms.Compose([
              transforms.ToTensor()])
              #transforms.Normalize((0.1307,), (0.3081,))])
)

test_data = datasets.MNIST(
    root='~/Downloads/',
    train=False,
    download=True,
    transform=ToTensor()
)

batchsize = 64
train_dataloader = DataLoader(training_data, batch_size=batchsize, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=8, shuffle=True)

Create a multiclass logistic regression (softmax) model and train.

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
class mnistClassifier(nn.Module):
    def __init__(self):
        super(mnistClassifier, self).__init__()
        self.layer1 = nn.Linear(784, 10, bias=True) 

    def forward(self, x):
        x = torch.log_softmax(self.layer1(x), dim=1)
        return x
        
hypothesis = mnistClassifier()

optimizer = optim.SGD(hypothesis.parameters(), lr = .001) 
Cost = nn.NLLLoss() # Negative log likelihood loss

J_history = []
for epoch in range(10):
    running_loss = 0
    for i, data in enumerate(train_dataloader):
        
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        inputs = inputs.reshape(inputs.shape[0],-1)
        
        optimizer.zero_grad()
        # forward pass
        out = hypothesis(inputs)

        loss = Cost(out, labels)

        # backward pass
        loss.backward()

        # update paramters
        optimizer.step()

        
        running_loss += loss.item()
        
        if i%300 == 0:
            print(f'Epoch {epoch+1}:{i+1} Loss: {loss.item()}')
            
    J_history += [running_loss]

Plot Convergence plot of gradient descent with respect to no of epochs

from matplotlib import pyplot as plt
plt.plot(J_history)
plt.title('Convergence plot of gradient descent')
plt.xlabel('No of Epochs')
plt.ylabel('J')
plt.show()

Test on a batch of test images and display images with the predicted label

# test on one batch of test data
inputs_im, labels =  next(iter(test_dataloader))
inputs = inputs_im.to(device)
labels = labels.to(device)
inputs = inputs.reshape(inputs.shape[0],-1)
out = hypothesis(inputs)
pr = torch.argmax(out, dim=1) # predicted labels

fig = plt.figure()
for i in range(6):
    plt.subplot(2,3,i+1)
    plt.tight_layout()
    im = torch.squeeze(inputs_im[i].detach(), dim=0).numpy()
    plt.imshow(im, cmap='gray', interpolation='none')
    plt.title("Predicted: {}".format(pr[i]))
    plt.xticks([])
    plt.yticks([])

Conclusion

Artificial Neural Networks are a powerful tool in the arsenal of machine learning, capable of tackling a wide range of tasks from image classification to financial forecasting. By understanding the core concepts such as perceptrons, loss minimization, backpropagation, and activation functions, you can harness the full potential of ANNs for your specific applications. Whether you are dealing with regression or classification problems, ANNs offer a robust framework for building intelligent systems.