Rohan

Paul

Vectorizing Gradient Descent — Multivariate Linear Regression and Python implementation

November 1st, 2021

In this article, I shall go over the topic of arriving at the Vectorized Gradient-Descent formulae for the Cost function of the for Matrix form of training-data Equations. And along with that the Fundamentals of Calculus (especially Partial Derivative) and Matrix Derivatives necessary to understand the process.

So our target of this article is to understand the full Mathematics and the flow behind arriving at the below formulae, which is the Vectorized Gradient of the training-data Matrix

First a Refresher on basic Matrix Algebra

A matrix A over a field K or, simply, a matrix A (when K is implicit) is a rectangular array of scalars usually presented in the following form:

The rows of such a matrix A are the m horizontal lists of scalars:

and the columns of A are the n vertical lists of scalars:

A matrix with m rows and n columns is called an m by n matrix, written m*n. The pair of numbers m and n is called the size of the matrix. Two matrices A and B are equal, written A = B, if they have the same size and if corresponding elements are equal. Thus, the equality of two m * n matrices is equivalent to a system of mn equalities, one for each corresponding pair of elements. A matrix with only one row is called a row matrix or row vector, and a matrix with only one column is called a column matrix or column vector. A matrix whose entries are all zero is called a zero matrix and will usually be denoted by 0.

Matrix Multiplication

The below image is taken from Khan Academy’s excellent linear algebra course.

In above, each entry in the product matrix is the dot product of a row in the first matrix and a column in the second matrix

More explanation for higher dimension case — If the product AB = C is defined, where C is denoted by [cij], then the element cij is obtained by multiplying the elements in the ith row of A by the corresponding elements in the jth column of B and adding. Thus, if A has order k * n, and B has order n * p then

There are four simple rules that will help us in multiplying matrices, listed here

Firstly, we can only multiply two matrices when the number of columns in

matrix A is equal to the number of rows in matrix B.

Secondly, the first row of matrix A multiplied by the first column of matrix B

gives us the first element in the matrix AB, and so on.

Thirdly, when multiplying, order matters — specifically, AB ≠ BA.
Lastly, the element at row i, column j is the product of the ith row of matrix A and the jth column of matrix B.

The order in which we multiply matters. We must keep the matrices in order, but we do have some flexibility. As we can see in the following equation, the parentheses can be moved:

Implementing a dot production with numpy

this will create integer as array-elements

import numpy as np

A = np.random.randint(10, _size_=(2,3))
B = np.random.randint(10, _size_=(3,2))

This will create 2 Matrices as below e.g.
[[9 0 8] [3 5 5]]

[[5 2]
 [0 7]
 [0 7]
]

print(np.dot(A, B))

# Output
[[45 74]
[15 76]]

The Hadamard Product — Other Kinds of Matrix Multiplication

Hadamard multiplication is defined for matrices of the same shape as the multiplication of each element of one matrix by the corresponding element of the other matrix. Hadamard multiplication is often denoted by as below, for two matrices A(n×m) and B(n×m) we have

So in general Mathematic form for the single independent variable case

Multiple Linear Regression (MLR)

Suppose that the response variable Y and at least one predictor variable xi are quantitative. Then the equation for a specific Y value under the MLR model is

And now in matrix notation, these n sets of equations become

where Y is the vector of the response variable and is an n × 1 vector of dependent variables, X is the matrix of the k independent/explanatory variables (usually the first column is a column of ones for the constant term) and is an n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1 vector of unknown errors. Equivalently,

Where Y : is output vector for n training examples. X : is matrix of size n*p where each ith row belongs to ith training set. β : is weight vector of size p for p training features.

Different types of Matrix Differentiation

1-Differentiation with Respect to a Scalar

Differentiation of a structure (vector or matrix, for example) with respect to a scalar is quite simple; it just yields the ordinary derivative of each element of the structure in the same structure. Thus, the derivative of a vector or a matrix with respect to a scalar variable is a vector or a matrix, respectively, of the derivatives of the individual elements.

1.A-Derivatives of Vectors with Respect to Scalars

The derivative of the vector y(x) = (y1 , . . . , yn ) with respect to the scalar x is the vector

The second or higher derivative of a vector with respect to a scalar is likewise a vector of the derivatives of the individual elements; that is, it is an array of higher rank.

1.B-Derivatives of Matrices with Respect to Scalars

The derivative of the matrix Y(x) defined as below

The second or higher derivative of a matrix with respect to a scalar is likewise a matrix of the derivatives of the individual elements.

1.C — Derivatives of Functions with Respect to Scalars

Differentiation of a function of a vector or matrix that is linear in the elements of the vector or matrix involves just the differentiation of the elements, fol- lowed by application of the function. For example, the derivative of a trace of a matrix is just the trace of the derivative of the matrix. On the other hand, the derivative of the determinant of a matrix is not the determinant of the derivative of the matrix

1.D — Higher-Order Derivatives with Respect to Scalars

Because differentiation with respect to a scalar does not change the rank of the object (“rank” here means rank of an array or “shape”), higher-order derivatives

if those derivatives exist. This vector is called the gradient of the scalar-valued function, and is sometimes denoted by ∇f (x)

2.B — Derivatives of Vectors with Respect to Vectors; The Jacobian

The derivative of an m-vector-valued function of an n-vector argument consists of nm scalar derivatives. These derivatives could be put into various structures. Two obvious structures are an n × m matrix and an m × n matrix.

For a function,

if those derivatives exist. This derivative is called the matrix gradient and is denoted by ∇f for the vector-valued function f . (Note that the ∇ symbol can denote either a vector or a matrix, depending on whether the function being differentiated is scalar-valued or vector-valued.)

The m × n matrix is called the Jacobian of f and is denoted by Jf as below

So we have m = n functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all m × n possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:

So if we are predicting house-price with the above MLR equation, then _θ_0 will be the basic/base price of a house, then _θ_1 as the price per room, _θ_2 as the price per KM-distance from the nearest Airport.

Using the definition of matrix multiplication, our multivariate hypothesis function can be concisely represented as:

So now the Cost function takes the following form

Where the thetas θ are the weights, and the above partial derivative for any weights wj will be as below

So the Gradient-Descent process for Multivariate case becomes

And that's why we take the transpose of θ to multiply with column-vector x to get the hypothesis (as earlier mentioned in this article)

Refresher — Matrix-Derivative Identities required for the Mathematical Derivation of the Gradient of a Matrix w.r.t. to Vectors

The derivative of a matrix is usually referred to as the gradient and denoted as ∇. Consider a function

Matrix Transpose Identities

Because of the associativity of matrix multiplication, this relation can be extended as

Now proceed to find the Gradient of the Cost function.

First a Refresher on the Gradient Descent Algorithm

To implement Gradient Descent, you need to compute the gradient of the cost function with regard to each model parameter θj. In other words, you need to calculate how much the cost function will change if you change θj just a little bit. This is called a partial derivative. It is like asking “What is the slope of the mountain under my feet if I face east?”. and then asking the same question facing north.

Now you would recognize the very well-known cost function

Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step.

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play:5 multiply the gradient vector by η to determine the size of the downhill step

Where Y : is output vector for n training examples. X : is matrix of size n*p where each ith row belongs to ith training set. β : is weight vector of size p for p training features.

Note that β in the above is not a scalar, but a vector.

Now we have the RSS defined as

Then for the following assumption

Then for the whole matrix (i.e. the whole set of training data set or the whole set of Hypothesis Equation ), we will get

Which in the final Vector form

Note, that in the last equality, I had to get the Transpose of X because when doing matrix multiplication — that's a dot product of rows of the first matrix to columns of the second matrix. The number of columns of the 1st matrix must equal the number of rows of the 2nd matrix.

So by transposing the _p-_th column of X ends up being the _p-_th row of the X-Transposed. Thus, when doing a dot product between the p-th row of X-Transposed with (y — Xβ) it will match perfectly as I am using all of the entries of the _p-_th column of X

Alternative-2 — And below is an alternative calculation for arriving at the same Vectorized Gradient formulae for training-data in Matrix form.

Here, I am denoting the coefficients with θ or Theta (instead of β that we used above in our Alternative-1 Gradient Calculation — only to make the presentation differentiable)

Again assume we have our Training Set of data as below

So we can say the below

And now again, we need to use the same vector identity mentioned above, that for a vector z, we have

We have already introduced the trace operator of a Matrix, written as “tr.” Now we need to use a couple of more matrix derivatives Identities (that I am just stating below here, and they all have robust Mathematical proofs, the details of which I am not including here).

So below 2 Matrix Derivative Identities hold true and we need to use them to arrive at the Gradient Calculation.

So now Final Gradient Calculation will be as below

And also the below Matrix Identity

And above is the exact formulae that we will implement in Python/Numpy very soon below.

So now let's go back to the original Cost Function

Compare the above with the Gradient-Descent formulae for the Numerical case

A manual example of the Gradient-Descent implementation

Let's say for simple single variable training dataset we have the following values

x,y1,12,23,34,4

Further, assume,

α (learning rate) = 1

m (number of training examples) =4

Setting _θ_0 to 0 and _θ_1 to 1

So we have the linear equation

at the _i_−th row

Further I denote,

So, regardless of how many times I apply the GD algorithm, the value of θ1 will be constantly equal to 1, since at every iteration we have _θ_0=0 and _θ_1=1

Now extend the above to a multivariable case,

Assume theta values have been picked at random as below

And the GD-Algorithm is,

Python/Numpy Implementation

Please refer to the jupyter notebook

First, generate a training dataset in Matrix form

NumPy zeros() function in above — you can create an array that only contains only zeros using the NumPy zeros() function with a specific shape. The shape is row by column format. Its syntax is as below

For example, the code to generate a Matrix of 2 by 3 (2 rows and 3 columns)

import numpy as np

A = np.zeros(shape = (2, 3)) print

Output below

[[0. 0. 0.] [0. 0. 0.]]

Which produces an array like the following:

If I run the above gen_data() function above for a set of 5 training data-set as below with bias and variance of 20 and 10 respectively

gen_data(5, 20, 10)

I will have the following form of output

(array([[1., 0.],
 [1., 1.],
 [1., 2.],
 [1., 3.],
 [1., 4.]]),
 array([22.38023816, 24.18406356, 28.01360908, 26.80051617, 29.30101971])
 )

And now the function for Gradient-Descent implementing the Grdient formulae for a Mactrix that we derived above

The full notebook is here

Reference and Further Reading

Matrix Multiplication — https://en.wikipedia.org/wiki/Matrix_multiplication

Matrix-Calculus — https://en.wikipedia.org/wiki/Matrix_calculus

Vector_Field — https://en.wikipedia.org/wiki/Vector_field

Matrix Transpose Properties -https://en.wikipedia.org/wiki/Transpose#Properties

Matrix Cookbook — https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Online Calculation of Matrix Derivative — http://www.matrixcalculus.org/

Wasserstein GAN Implementation from Scratch with PyTorch

Implementation of Wasserstein GAN Architecture from Scratch read more...

March 20th, 2022

CycleGAN Paper Implementation from Scratch with PyTorch

Implementation of CycleGAN Architecture from Scratch read more...

March 10th, 2022

CycleGAN Architecture and Paper Walkthrough

Understanding CycleGAN Architecture read more...

March 10th, 2022

DCGAN Implementation From Scratch with PyTorch

DCGAN Implementation From Scratch with PyTorch on MNIST Dataset read more...

March 10th, 2022

GoogLeNet or Inception v1 Paper Implementation From Scratch with PyTorch

GoogLeNet Inception v1 Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...

February 22nd, 2022

ResNet Paper Implementation From Scratch with PyTorch

Residual Network Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...

February 21st, 2022

Input Shape of Tensor in Neural Network for PyTorch

Understanding nn.Linear() Layer and nn.Conv2D() Layer read more...

February 23rd, 2022

Label Smoothing And CrossEntrophy Loss in PyTorch

Quantization refers to techniques for computing and accessing. read more...

February 20th, 2022

Weight decay in PyTorch and its relation with Learning Rate | L2 Regularization

Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. read more...

February 19th, 2022

Quantization in PyTorch & Mixed Precision Training

Quantization refers to techniques for computing and accessing memory with lower-precision data. read more...

February 14th, 2022

Build LeNet5 from Scratch with PyTorch

LeNet5 is one of the classic Neural Network and great to start with if you are a beginner read more...

February 12th, 2022

EfficientNet Pre-Trained with PyTorch - Covid-19 X-Ray Dataset

EfficientNet is a convolutional neural network architecture and scaling method developed by Google in 2019. read more...

February 4th, 2022

Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer

Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer read more...

January 29th, 2022

Image Augmentation with Albumentation - on Kaggle Dataset - TensorFlow Great Barrier Reef

Preparing your Image data for Deep Learning read more...

January 29th, 2022

DCGAN - Generator Function - Understanding Filter Size and Input Shape

Reason for using 4 * 4 * 515 Shape for the Input Dense Layer in the Generator Function. read more...

January 21st, 2022

Deep Neural Network - Base Pytorch Model - FMNIST Dataset

PyTorch Implementation of Classification on Fashion-MNIST dataset which consists of a training set of 60,000 images and test set of 10,000 images read more...

December 30th, 2021

Learn Tensorflow Fundamentals in 45 Mints

A number of tips and techniques useful for daily usage read more...

December 30th, 2021

Numpy Useful Techniques

A number of tips and techniques useful for daily usage read more...

December 28th, 2021

Image Matching & Stitching with Image Descriptors and Creating Panoramic Scene

Given a pair of images I want to stitch them to create a panoramic scene. read more...

December 24th, 2021

ML-Algorithms from scratch in pure python without using sklearn

TFIDF, GridSearchCV, RandomSearchCV, Decision Function of SVM, RBF Kernel, Platt Scaling to find P(Y==1|X) and SGD Classifier with Logloss and L2 regularization read more...

December 22nd, 2021

Computing performance metrics in Pure Python without scikit-learn

Here I will compute Confusion Matrix, F1 Score, AUC Score without using scikit-learn read more...

December 21st, 2021

Understanding Shi-Tomasi Corner Detection Algorithm with OpenCV Python

Shi-Tomasi Corner Detection is an improved version of the Harris Corner Detection Algorithm. read more...

December 21st, 2021

Understanding Harris Corner Detection Algorithm with OpenCV Python

Harris Corner Detection uses a score function to evaluate whether a point is a corner or not. First it computes the horizontal and vertical derivatives (edges) of an image, read more...

December 19th, 2021

Feature-Engineering - Naive-Bayes with Bag-of-Words on Donor Choose Dataset

Understanding Naive Bayes Mathematically and applying on Donor Choose Dataset read more...

December 17th, 2021

Decision Tree on Donors Choose Dataset Kaggle

In this post, I will implement Decision Tree Algorithm on Donor Choose Dataset read more...

December 14th, 2021

Decision Function of SVM (Support Vector Machine) RBF Kernel - Building From Scratch

Decision function is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. read more...

December 13th, 2021

DCGAN from Scratch with Tensorflow Keras — Create Fake Images from CELEB-A Dataset

A DCGAN (Deep Convolutional Generative Adversarial Network) is a direct extension of the GAN. read more...

December 12th, 2021

Kaggle Haberman's Survival Data Set - Exploratory Data Analysis

The Haberman's survival dataset covers cases from a study by University of Chicago's Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer. read more...

December 11th, 2021

Kaggle House Prices Prediction Competition - Mathematics of Linear Regression

This blog on Linear Regression is about understanding mathematically the concept of Gradient Descent function and also some EDA. read more...

December 10th, 2021

Common Pytorch Snippets

In this notebook I will go over some regular snippets and techniques of it. read more...

December 9th, 2021

Understanding in Details the Mathematics of Logistic Regression and implementing From Scratch with pure Python

Logistic regression is a probabilistic classifier similar to the Naïve Bayes read more...

December 8th, 2021

Most Common Python Challenges asked in Data Science Interview

In this Notebook I shall cover the following most common Python challenges for Data Science Interviews. read more...

December 4th, 2021

Multi ArmedBandit - Kaggle Santa Competition

Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. read more...

December 2nd, 2021

Dimensionality Reduction with t-SNE

In this post I will talk about Dimensionality Reduction with t-SNE (t-Distributed Stochastic Neighbor Embedding) using the famous **Digit Recognizer Dataset** (also known as MNIST data) read more...

December 2nd, 2021

Dimensionality Reduction with PCA

In this post I will be using Kaggle's famous **Digit Recognizer Dataset** (also known as MNIST data) to implement Dimensionality Reduction with PCA (Principle Component Analysis). read more...

November 30th, 2021

TF-IDF Model and its implementation with Scikit-learn

In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn. read more...

November 30th, 2021

TFIDF from scratch in pure python without using sklearn

Here I shall implementing TFIDF from scratch in pure python without using sklearn or any other similar packages read more...

November 28th, 2021

Why Platt Scaling and implementation from scratch

Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities. read more...

November 26th, 2021

Why Bootstrapping is useful and Implementation of Bootstrap Sampling in Random Forests

Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. read more...

November 23rd, 2021

What K-Fold Cross Validation really is in Machine Learning in simple terms

k-fold cross-validation is one of the most popular strategies widely used by data scientists. It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. read more...

November 20th, 2021

Neural Network — Implementing Backpropagation using the Chain Rule

In this post, I will go over the mathematical need and the derivation of Chain Rule in a Backpropagation process. read more...

November 18th, 2021

Constant-Q transform with Gravitational Waves Data

The constant-Q transform transforms a data series to the frequency domain. It is related to the Fourier transform. read more...

November 15th, 2021

All the basic Matrix Algebra you will need in Data Science

Most used Matrix Maths in a Nutshell read more...

November 13th, 2021

Batch Normalization in Deep Learning

A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and normalizes by subtracting the mean and dividing by the standard deviation. read more...

November 10th, 2021

Why F1 Score uses Harmonic Mean and not an Arithmetic Mean

The F1 score is the harmonic mean of precision and recall, taking both metrics into account read more...

November 8th, 2021

Why we do log transformation of variables and interpretation of Logloss

Original number = x and Log Transformed number x=log(x) read more...

November 6th, 2021

Location-Sensitive-Hashing-for-cosine-similarity

What is cosine distance and cosine similarity read more...

November 4th, 2021

Euclidean Distance and Normalization of a Vector

Euclidean distance is the shortest distance between two points in an N-dimensional space also known as Euclidean space. read more...

November 4th, 2021

Data Representations For Neural Networks - Understanding Tensor, Vector and Scaler

At its core, a tensor is a container for data — almost always numerical data. read more...

November 2nd, 2021

Axes and Dimensions in Numpy and Pandas Array

Understanding the shape and Dimension will be one of the most crucial thing in Machine Learning and Deep Learning Project. This blog makes it clear. read more...

November 1st, 2021

Prime Factorization of a Number in Python and why we check upto the square root of the Number

Understanding the Mathematical Reasoning read more...

October 25th, 2021

Understanding Bias-Variance Trade-off

In this post I shall discuss the concept around and the Mathematics behind the below formulation of Bias-Variance Tradeoff. read more...

October 19th, 2021

Fundamentals of Multivariate Calculus for DataScience and Machine Learning

as soon as you need to implement multi-variate Linear Regression, you hit multivariate-calculus which is what you will have to use to derive the Gradient of a set of multi-variate Linear Equations i.e. Derivative of a Matrix. read more...

October 15th, 2021

Concepts of Differential Calculus for understanding the derivation of Gradient Descent in Linear Regression

The most fundamental definition of Derivative can be stated as — derivative measures the steepness of the graph of a function at some particular point on the graph. read more...

October 9th, 2021

Discrete vs Continuous Probability Distributions in context of Data Science

Discrete and Continuos Random Variable and related Probabilities read more...

October 6th, 2021

yfinance to get historical financials from free and Tesla Stock Prediction

Using yfinance open source library access great amount of historical financial data for Free read more...

October 6th, 2021

The mean, the median, and mode of a Random Variable

The mean is called a measure of central tendency because it tells us something about the center of a distribution, specifically its center. read more...

October 4th, 2021

BitCoin Price Prediction using Deep Learning (LSTM Model)

Using the historical data, I will implement a recurrent neural netwok using LSTM (Long short-term memory) layers to predict the trend of cryptocurrency values in the future. read more...

September 30th, 2021

Moving Averages with Bitcoin Historical Data

Moving averages are one of the most often-cited data-parameter in the space of Stock market trading, technical analysis of market and is extremely useful for forecasting long-term trends. read more...

September 29th, 2021

Categorical Data Handling in Machine Learning Projects

Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. read more...

September 26th, 2021

Tensorflow-Keras Custom Layers and Fundamental Building Blocks For Training Model

In TensorFlow, we can define custom data augmentations(e.g. mixup, cut mix) as a custom layer using subclassing, which I will talk about in this blog read more...

September 23rd, 2021

Tensorflow Mixed Precision Training — CIFAR10 dataset Implementation

Mixed precision training is a technique used in training a large neural network where the model’s parameter are stored in different datatype precision (FP16 vs FP32 vs FP64). It offers significant performance and computational boost by training large neural networks in lower precision formats. read more...

September 21st, 2021