November 20th, 2021

First, for this post, I will consider a really simple Neural Network architecture which is the following.
Our neural network contains one hidden layer with three nodes in it.
Every arrow in the diagram contains exactly one float value as a single weight that is adjustable. Totally there are 9 weights (6 in the first hidden layer and 3 in the second) that we need to fine-tune, so that when the input is (1,1), the output is as close to (0) as possible.
This is what we mean by training the neural network. We have not introduced a bias value yet, for simplicity purposes only — the underlying logic remains the same.
Now let's go over the basics of a Neural Network
A neuron is a container that contains a mathematical function which is known as an activation function, inputs (x1 and x2 here ) , a vector of weights(w1,w2 here) and a bias(b).
A neuron first computes the weighted sum of the inputs.
The activation function is simply a mathematical function that takes in an input and produces an output.
Think of the activation function as a mathematical operation that normalizes the input and produces an output. The output is then passed forward onto the neurons on the subsequent layer.
Input layers: These layers take the independent variables as input.
Hidden (intermediate) layers: These layers connect the input and output layers while performing transformations on top of input data. Furthermore, the hidden layers contain nodes (units/circles in the above diagram) to modify their input values into higher-/lower-dimensional values. The functionality to achieve a more complex representation is achieved by using various activation functions that modify the values of the nodes of intermediate layers.
Output layer: This contains the values the input variables are expected to result in.
The number of nodes (circles in the preceding diagram) in the output layer depends on the task at hand and whether we are trying to predict a continuous variable or a categorical variable. If the output is a continuous variable, the output has one node. If the output is categorical with m possible classes, there will be m nodes in the output layer. Let’s zoom into one of the nodes/neurons and see what’s happening. A neuron transforms its inputs as follows:



As you can see, it is the sum of the products of weight and input pairs followed by an additional function f (the bias term + sum of products). The function f is the activation function that is used to apply non-linearity on top of this sum of products.
Calculating the hidden layer unit values
We’ll now assign weights to all of the connections. In the first step, we assign weights randomly across all the connections. And in general, neural networks are initialized with random weights before the training starts.
Let’s start with initial weights that are randomly initialized between 0 and 1, but note that the final weights after the training process of a neural network don’t need to be between a specific set of values. A formal representation of weights and values in the network is provided in the following diagram (left half) and the randomly initialized weights are provided in the network in the right half.

In the next step, we perform the multiplication of the input with weights to calculate the values of hidden units in the hidden layer. The hidden layer’s unit values before activation are obtained as follows:

Applying the activation function
Activation functions help in modeling complex relations between the input and the output. Some of the frequently used activation functions are calculated as follows (where x is the input):


Loss values (alternatively called cost functions) are the values that we optimize for in a neural network. To understand how loss values get calculated, let’s look at two scenarios:
Categorical variable prediction
Continuous variable prediction
Calculating loss during continuous variable prediction
Typically, when the variable is continuous, the loss value is calculated as the mean of the square of the difference in actual values and predictions, that is, we try to minimize the mean squared error by varying the weight values associated with the neural network. The mean squared error value is calculated as follows:

Calculating loss during categorical variable prediction
When the variable to predict is discrete (that is, there are only a few categories in the variable), we typically use a categorical cross-entropy loss function. When the variable to predict has two distinct values within it, the loss function is binary cross-entropy. Binary cross-entropy is calculated as follows:

y is the actual value of the output, p is the predicted value of the output, and m is the total number of data points.
And then Categorical cross-entropy is calculated as follows:

y is the actual value of the output, p is the predicted value of the output, m is the total number of data points, and C is the total number of classes.
A simple way of visualizing cross-entropy loss is to look at the prediction matrix itself. Say you are predicting five classes — Dog, Cat, Rat, Cow, and Hen — in an image recognition problem. The neural network would necessarily have five neurons in the last layer with softmax activation. It will be thus forced to predict a probability for every class, for every data point. Say there are five images and the prediction probabilities look like so (the highlighted cell in each row corresponds to the target class):

Note that each row sums to 1. In the first row, when the target is Dog and the prediction probability is 0.88, the corresponding loss is 0.128 (which is the negative of the log of 0.88). Similarly, other losses are computed. As you can see, the loss value is less when the probability of the correct class is high. As you know, the probabilities range between 0 and 1. So, the minimum possible loss can be 0 (when the probability is 1) and the maximum loss can be infinity when the probability is 0. The final loss within a dataset is the mean of all individual losses across all rows.
Feedforward propagation — A high-level strategy of coding feedforward propagation is as follows:
2. Compute activation.
3. Repeat the first two steps at each neuron until the output layer.
4. Compute the loss by comparing the prediction with the actual output.
In feedforward propagation, we connected the input layer to the hidden layer, which then was connected to the output layer. In the first iteration, we initialized weights randomly and then calculated the loss resulting from those weight values. In backpropagation, we take the reverse approach. We start with the loss value obtained in feedforward propagation and update the weights of the network in such a way that the loss value is minimized as much as possible.
The loss value is reduced as we perform the following steps:

Note that the update made to a particular weight is proportional to the amount of loss that is reduced by changing it by a small amount. Intuitively, if changing a weight reduces the loss by a large value, then we can update the weight by a large amount. However, if the loss reduction is small by changing the weight, then we update it only by a small amount.
Ans is — Backpropagation reduce the error by changing the values of weights and biases. To do this during Backpropagation we calculate the rate of change of error w.r.t rate of change in weight.
Note that the update made to a particular weight is proportional to the amount of loss that is reduced by changing it by a small amount.
Intuitively this makes sense, because if changing a weight reduces the loss by a large value, then we can update the weight by a large amount. However, if the loss reduction is small by changing the weight, then we update it only by a small amount.
First, consider the cost as a function of the weights C=C(w) alone. You number the weights w₁,w₂,…, and want to compute ∂C/∂wᵢ for some particular weight wᵢ. An obvious way of doing that is to use the approximation

where ϵ>0 is a small positive number, and eᵢ is the unit vector in the iᵗʰ direction. In other words, we can estimate ∂C/∂wᵢ by computing the cost C for two slightly different values of wᵢ, and then applying the above equation. The same idea can be used to compute the partial derivatives ∂C/∂b with respect to the biases. But there’s a problem with this approach, which is when dealing with millions of weights this approach will be impossibly slow or sometimes can not be realistically implemented at all.
To understand why it is like so, imagine we have a million weights in our network. Then for each distinct weight wᵢ we need to compute C(w+ϵeᵢ) in order to compute ∂C/∂wᵢ. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example).
And that's where the chain rule in Back-propagation comes to save us, with a specific order of operations that is highly efficient.
What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wᵢ using just one forward pass through the network, followed by one backward pass through the network.
And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes of the previous method.
First, take a look at our Neural Network again.


The hidden layer activation value (sigmoid activation) is calculated as follows:


Note that, this change in the loss value C with respect to the change in weight W_11 is what is THE MOST IMPORTANT thing to calculate during the Backpropagation part of Neural Network Training. And hence we need to find a way to calculate this as optimally as possible.

Note that, in the preceding equation, we have built a chain of partial differential equations in such a way that we are now able to perform partial differentiation on each of the four components individually and ultimately calculate the derivative of the loss value with respect to weight value w_11
Why exactly we are calculating this Partial Derivative applying Chain Rule
We create a chain so that each individual partial derivative can be easily calculated and we get the derivative of two variables which are not directly connected i.e. inner layer weights and the Loss
Now, the individual partial derivatives in the preceding equation are computed as follows:



Note that the preceding equation (3 ) comes from the fact that the derivative of the sigmoid function is..



So the finally, with all the above 4 component wise partial differentiation in place, the gradient of the loss value with respect to is calculated by replacing each of the partial differentiation terms with the corresponding value as calculated in the previous steps as follows:

From the preceding formula, we can see that we are now able to calculate the impact on the loss value of a small change in the weight value (the gradient of the loss with respect to weight) without brute-forcing our way by recomputing the feedforward propagation again.
And now finally, we will go ahead and update the weight value as follows:

Other articles
March 28th, 2022
Implementation of Wasserstein GAN Architecture from Scratch read more...
March 20th, 2022
Implementation of CycleGAN Architecture from Scratch read more...
March 10th, 2022
Understanding CycleGAN Architecture read more...
March 10th, 2022
DCGAN Implementation From Scratch with PyTorch on MNIST Dataset read more...
March 10th, 2022
GoogLeNet Inception v1 Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...
February 22nd, 2022
Residual Network Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...
February 21st, 2022
Understanding nn.Linear() Layer and nn.Conv2D() Layer read more...
February 23rd, 2022
Quantization refers to techniques for computing and accessing. read more...
February 20th, 2022
Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. read more...
February 19th, 2022
Quantization refers to techniques for computing and accessing memory with lower-precision data. read more...
February 14th, 2022
LeNet5 is one of the classic Neural Network and great to start with if you are a beginner read more...
February 12th, 2022
EfficientNet is a convolutional neural network architecture and scaling method developed by Google in 2019. read more...
February 4th, 2022
Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer read more...
January 29th, 2022
Preparing your Image data for Deep Learning read more...
January 29th, 2022
Reason for using 4 * 4 * 515 Shape for the Input Dense Layer in the Generator Function. read more...
January 21st, 2022
PyTorch Implementation of Classification on Fashion-MNIST dataset which consists of a training set of 60,000 images and test set of 10,000 images read more...
December 30th, 2021
A number of tips and techniques useful for daily usage read more...
December 30th, 2021
A number of tips and techniques useful for daily usage read more...
December 28th, 2021
Given a pair of images I want to stitch them to create a panoramic scene. read more...
December 24th, 2021
TFIDF, GridSearchCV, RandomSearchCV, Decision Function of SVM, RBF Kernel, Platt Scaling to find P(Y==1|X) and SGD Classifier with Logloss and L2 regularization read more...
December 22nd, 2021
Here I will compute Confusion Matrix, F1 Score, AUC Score without using scikit-learn read more...
December 21st, 2021
Shi-Tomasi Corner Detection is an improved version of the Harris Corner Detection Algorithm. read more...
December 21st, 2021
Harris Corner Detection uses a score function to evaluate whether a point is a corner or not. First it computes the horizontal and vertical derivatives (edges) of an image, read more...
December 19th, 2021
Understanding Naive Bayes Mathematically and applying on Donor Choose Dataset read more...
December 17th, 2021
In this post, I will implement Decision Tree Algorithm on Donor Choose Dataset read more...
December 14th, 2021
Decision function is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. read more...
December 13th, 2021
A DCGAN (Deep Convolutional Generative Adversarial Network) is a direct extension of the GAN. read more...
December 12th, 2021
The Haberman's survival dataset covers cases from a study by University of Chicago's Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer. read more...
December 11th, 2021
This blog on Linear Regression is about understanding mathematically the concept of Gradient Descent function and also some EDA. read more...
December 10th, 2021
In this notebook I will go over some regular snippets and techniques of it. read more...
December 9th, 2021
Logistic regression is a probabilistic classifier similar to the Naïve Bayes read more...
December 8th, 2021
In this Notebook I shall cover the following most common Python challenges for Data Science Interviews. read more...
December 4th, 2021
Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. read more...
December 2nd, 2021
In this post I will talk about Dimensionality Reduction with t-SNE (t-Distributed Stochastic Neighbor Embedding) using the famous **Digit Recognizer Dataset** (also known as MNIST data) read more...
December 2nd, 2021
In this post I will be using Kaggle's famous **Digit Recognizer Dataset** (also known as MNIST data) to implement Dimensionality Reduction with PCA (Principle Component Analysis). read more...
November 30th, 2021
In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn. read more...
November 30th, 2021
Here I shall implementing TFIDF from scratch in pure python without using sklearn or any other similar packages read more...
November 28th, 2021
Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities. read more...
November 26th, 2021
Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. read more...
November 23rd, 2021
k-fold cross-validation is one of the most popular strategies widely used by data scientists. It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. read more...
November 18th, 2021
The constant-Q transform transforms a data series to the frequency domain. It is related to the Fourier transform. read more...
November 15th, 2021
Most used Matrix Maths in a Nutshell read more...
November 13th, 2021
A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and normalizes by subtracting the mean and dividing by the standard deviation. read more...
November 10th, 2021
The F1 score is the harmonic mean of precision and recall, taking both metrics into account read more...
November 8th, 2021
Original number = x and Log Transformed number x=log(x) read more...
November 6th, 2021
What is cosine distance and cosine similarity read more...
November 4th, 2021
Euclidean distance is the shortest distance between two points in an N-dimensional space also known as Euclidean space. read more...
November 4th, 2021
At its core, a tensor is a container for data — almost always numerical data. read more...
November 2nd, 2021
Understanding the shape and Dimension will be one of the most crucial thing in Machine Learning and Deep Learning Project. This blog makes it clear. read more...
November 1st, 2021
Understanding the Mathematical Reasoning read more...
November 1st, 2021
Vectorized Gradient-Descent formulae for the Cost function of the for Matrix form of training-data Equations. read more...
October 25th, 2021
In this post I shall discuss the concept around and the Mathematics behind the below formulation of Bias-Variance Tradeoff. read more...
October 19th, 2021
as soon as you need to implement multi-variate Linear Regression, you hit multivariate-calculus which is what you will have to use to derive the Gradient of a set of multi-variate Linear Equations i.e. Derivative of a Matrix. read more...
October 15th, 2021
The most fundamental definition of Derivative can be stated as — derivative measures the steepness of the graph of a function at some particular point on the graph. read more...
October 9th, 2021
Discrete and Continuos Random Variable and related Probabilities read more...
October 6th, 2021
Using yfinance open source library access great amount of historical financial data for Free read more...
October 6th, 2021
The mean is called a measure of central tendency because it tells us something about the center of a distribution, specifically its center. read more...
October 4th, 2021
Using the historical data, I will implement a recurrent neural netwok using LSTM (Long short-term memory) layers to predict the trend of cryptocurrency values in the future. read more...
September 30th, 2021
Moving averages are one of the most often-cited data-parameter in the space of Stock market trading, technical analysis of market and is extremely useful for forecasting long-term trends. read more...
September 29th, 2021
Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. read more...
September 26th, 2021
In TensorFlow, we can define custom data augmentations(e.g. mixup, cut mix) as a custom layer using subclassing, which I will talk about in this blog read more...
September 23rd, 2021
Mixed precision training is a technique used in training a large neural network where the model’s parameter are stored in different datatype precision (FP16 vs FP32 vs FP64). It offers significant performance and computational boost by training large neural networks in lower precision formats. read more...
September 21st, 2021
For this Project — I applied LightGBM + XGBoost + CatBoost Ensemble to achieve a Top 11% score in the Kaggle Competition ( Santander Value… read more...
September 18th, 2021
Understanding from scratch how to convert an image to Grayscale and increase the Brightness of an image, working at the pixel level and some related fundamentals of image processing with Python read more...
September 13th, 2021
In this Kaggle Competition, data-science helping to find Gravitational Waves by building models to filter out noises from data-streams read more...
September 12th, 2021
The great Kaggle Competition for G2Net Gravitational Wave Detection. Here, I shall go through the fundamental introduction on Gravitational waves, and some related concepts required for this competition. read more...
September 8th, 2021
Here I apply a large number of Feature Engineering to extract features from the 500GB dataset of Microsoft Malware Classification Kaggle Competiton and then apply XGBoost to achieve a LogLoss of 0.007. read more...
September 5th, 2021
In SGD while selecting data points at each step to calculate the derivatives. SGD randomly picks one data point from the whole data set at each iteration to reduce the computations enormously read more...
September 1st, 2021
Random Search sets up a grid of hyperparameter values and selects random combinations to train the model and score. read more...
August 30th, 2021
Implementing Custom GridSearchCV without scikit-learn. read more...
August 24th, 2021
A series solving some fundamental Probability Problems in the context of DataScience and Machine Learning read more...