October 9th, 2021

First lets define some terms for clarity
The sample space Ω The sample space is the set of all possible outcomes of the experiment, usually denoted by Ω. For example, two successive coin tosses have a sample space of {hh, tt, ht, th}, where “h” denotes “heads” and “t” denotes “tails”.
The event space A The event space is the space of potential results of the experiment. A subset A of the sample space Ω is in the event space A if at the end of the experiment we can observe whether a particular outcome ω ∈ Ω is in A. The event space A is obtained by considering the collection of subsets of Ω, and for discrete probability distributions (Section 6.2.1) A is often the power set of Ω.
The probability P With each event A ∈ A, we associate a number P (A) that measures the probability or degree of belief that the event will occur. P (A) is called the probability of A.
The probability of a single event must lie in the interval [0, 1], and the total probability over all outcomes in the sample space Ω must be 1, i.e., P (Ω) = 1. Given a probability space (Ω, A, P ), we want to use it to model some real-world phenomenon. In machine learning, we often avoid explic- itly referring to the probability space, but instead refer to probabilities on quantities of interest, which we denote by T as the target space and refer to elements of of T as states.
The term probability relates is to an event and probability distribution relates is to a random variable.
It is a convention that the term probability mass function refers to the probability distribution of a discrete random variable and the term probability density function refers to the probability function of a continuous random variable.
First a quick reference on PMF, PDF and CDF

In order to understand the heart of modern probability, we need to extend the concept of integration from basic calculus.
To begin, let us consider the following piecewise function

Applying the fundamental Riemann integration of Calculus we get

which has the usual interpretation as the area of the two rectangles that make up f (x).
The question is given f (x) = 1, what is the set of x values for which this is true? For our example, this is true whenever x ∈ (0, 1]. So now we have a correspondence between the values of the function (namely, 1 and 2) and the sets of x values for which this is true, namely, {(0, 1]} and {(1, 2]}, respectively. To compute the integral, we simply take the function values (i.e., 1,2) and some way of measuring the size of the corresponding interval.
Since areas can be defined by definite integrals, we can also define the probability of an event occuring within an interval [a, b] by the definite integral

where f(x) is called the probability density function (pdf).
A function f(x) is called a probability density function if

i.e. the area under the graph of f(x) from a to b.
In the problem above, the probability density function f(x) is called a uniform (flat) probability density function (pdf).
So fundamentally, what does a probability density at point 𝒙 mean?
Probability density function’s value at some specific point does not give you probability; it is a measure of how dense the distribution is around that value. It means how much probability is concentrated per unit length (d𝒙) near 𝒙, or how dense the probability is near 𝒙.
For discrete random variables, we look up the value of a PMF at a single point to find its probability P(𝐗=𝒙) For continuous random variables, we take an integral of a PDF over a certain interval to find its probability that X will fall in that interval.
Given a random experiment with sample space S,a random variable X is a set function that assigns one and only one real number to each element s that belongs in the sample space S.
The set of all possible values of the random variable X, denoted x, I am calling here as the support, or space, of X.
Note that the capital letters at the end of the alphabet, such as W,X,Y, and Z typically represent the definition of the random variable. The corresponding lowercase letters, such as w,x,y, and z, represent the random variable’s possible values.
By a discrete random variable, it is meant a function (or a mapping), say X, from a sample space Ω, into the set of real numbers. Symbolically, if ω ∈Ω, then X (ω ) = x, where x is a real number.
A random variable X is a discrete random variable if:
A countably infinite number of possible outcomes means that there is a one-to-one correspondence between the outcomes and the set of integers.
No such one-to-one correspondence exists for an uncountably infinite number of possible outcomes.
For a value x of the set of possible outcomes of the random variable X , i.e., x ∈ T , p(x) denotes the probability that random variable X has the outcome x.
For discrete random variables, this is written as P (X = x), which is known as the probability mass function. The pmf is often referred to as the distribution”. For continuous variables, p(x) is called the probability density function (often referred to as a density).
When we say probability distribution it may pertain to a discrete random variable or a continuous random variable, depending on the context.
When the random variable is discrete, probability distribution means, how the total probability is distributed over various possible values of the random variable. Consider the experiment of tossing two unbiased coins simultaneously. Then, sample space S associated with this experiment is:
S = {HH,HT,TH,TT}
If we define a random variable X as: the number of heads on this sample space S, then we will have
X(HH)=2,
X(HT)=X(TH)=1,
X(TT)=0
X(HH)=2,
X(HT)=X(TH)=1,
X(TT)=0.
The probability distribution of XX is then given by

For a discrete random variable, we consider events of the type {X=x} and compute probabilities of such events to describe the distribution of the random variable.
The Probability Mass Function of a Discrete Random Variable expresses the probability of the variable being equal to each specific value in the range of all potential discrete values defi ned.The sum of these probabilities over all possible values equals 100%.
In mathematical form, the probability that a discrete random variable X takes on a particular value x, that is, P(X=x), is frequently denoted f(x). The function f(x) is typically called the probability mass function
Let X be a discrete random variable with possible values denoted x1, x2, xi, x1, x2, xi,…. The probability mass function of X, denoted p

The same above in more general mathematical form, the probability mass function, P(X=x)=f(x), of a discrete random variable X is a function that satisfies the following properties:

First item basically says that, for every element x in the support S, all of the probabilities must be positive. Note that if x does not belong in the support S, then f(x)=0. The second item basically says that if you add up the probabilities for all of the possible x values in the support S, then the sum must equal 1. And, the third item says to determine the probability associated with the event A, you just sum up the probabilities of the x values in A.
Since f(x) is a function, it can be presented:
If a random variable can take only a finite number of discrete values, then it is discrete.
A fair die is a small cube with a natural number from 1 to 6 engraved on each side equally spaced without repetition. The fairness means that a die is made so that its weight is equally spread and, thus, all six faces are equally likely to face when rolled. So, if rolled, the set of numbers { 1,2,3,4,5,6} is the sample space of this experiment.
A continuous random variable differs from a discrete random variable in that it takes on an uncountably infinite number of possible outcomes.
While for a discrete random variable X that takes on a finite or countably infinite number of possible values, we determined P(X=x) for all of the possible values of X, and called it the probability mass function (“p.m.f.”). For continuous random variables, the probability that X takes on any particular value x is 0. That is, finding P(X=x) for a continuous random variable X is not going to work. Instead, we’ll need to find the probability that X falls in some interval (a,b), that is, we’ll need to find P(a<X<b). We’ll do that using a probability density function (“p.d.f.”).
The Probability Density Function of a Continuous Random Variable expresses the rate of change in the probability distribution over the range of potential continuous values defined, and expresses the relative likelihood of getting one value in comparison with another.
A nondiscrete random variable X is said to be absolutely continuous, or simply continuous, if its distribution function may be represented as

It follows from the above that if X is a continuous random variable, then the probability that X takes on any one particular value is zero, whereas the interval probability that X lies between two different values, say, a and b, is given by

A function that satisfies the above requirements is called a probability function or probability distribution for a continuous random variable, but it is more often called a probability density function or simply density function. Any function f (x) satisfying Properties 1 and 2 above will automatically be a density function, and required probabilities can then be obtained from the more general form below
A function f : RD → R is called a probability density function (pdf ) if

As you can see, the definition for the p.d.f. of a continuous random variable differs from the definition for the p.m.f. of a discrete random variable by simply changing the summations that appeared in the discrete case to integrals in the continuous case.
Now at the start of this article we discussed how density histogram (representing frequency) is defined so that the area of each rectangle equals the relative frequency of the corresponding class, and the area of the entire histogram equals 1. That suggests then that finding the probability that a continuous random variable X falls in some interval of values involves finding the area under the curve f(x) sandwiched by the endpoints of the interval.
So from a large sample space of Pizza, the probability that a randomly selected Pizza weighs between 0.20 and 0.30 pounds is then this area: (which is what the definite Integral formulae above calculates )

Some examples of well known discrete probability distributions include:
Some examples of common domains with well-known discrete probability distributions include:

Now lets see an simple actual exmaple of Discrete Probability Distribution. Quickly revisit the definition
The probability distribution of a discrete random variable X is a list of each possible value of X together with the probability that X takes that value in one trial of the experiment.
I start with a simple experiment, tossing a fair coin 10 times, and measured how many successes/heads I observe. I can use the number of successes (heads) observed in many ways to understand the basics of probability. For example, I could simply count how many times we see 0 heads, 1 head, 2 heads with our fair coin toss, and so on. Or here, I am just denoting the outcome with ‘H’ or ‘T’ for each experiment.
Now a quick and simple Math example of PDF
Let X be a continuous random variable whose probability density function is:

which is clearly not a probability! In the continuous case, f(x) is instead the height of the curve at X=x, so that the total area under the curve is 1. In the continuous case, it is areas under the curve that define the probabilities.
What is P(X=1/2)?
It is a straightforward integration to see that the probability is 0:

In general, if X is continuous, the probability that X takes on any specific value x is 0. That is, when X is continuous, P(X=x)=0 for all x in the support.
An implication of the fact that P(X=x)=0 for all x when X is continuous is that you can be less precise about the endpoints of intervals when finding probabilities of continuous random variables. That is:

for any constants a and b.
Further explanation of the above principle
The probability of observing any single value of the continuous random variable is 0 since the number of possible outcomes of a continuous random variable is uncountable and infinite. That is, for a continuous random variable, we must calculate a probability over an interval rather than at a particular point. This is why the probability for a continuous random variable can be interpreted as an area under the curve on an interval. In other words, we cannot describe the probability distribution of a continuous random variable by giving probability of single values of the random variable as we did for a discrete random variable. This property can also be seen from the fact that

for any real c
In the case of of continuous random variable, we should not ask for the probability that X is exactly a single number (since that probability is zero). Instead, we need to think about the probability that x is close to a single number.
We capture the notion of being close to a number with a probability density function which is normally denoted by P(x). If the probability density around a point x is large, that means the random variable X is likely to be close to x. If, on the other hand, P(x)=0 in some interval, then X won’t be in that interval.
So building on the Integration concept of Calculus
If the probability of X being exactly at point 𝒙 is zero, how about an extremely small interval around the point 𝒙? Say, [𝒙, 𝒙+d𝒙]?
Let’s assume d𝒙 is infinitesimally small with a value of 0.00000000001.
Then the probability that X will fall in [𝒙, 𝒙+d𝒙] is the Area under the curve f(𝒙) sandwiched by [𝒙, 𝒙+d𝒙].
The Area Under a Curve — Integral Calculus Basics
The area under a curve between two points can be found by doing a definite integral between the two points. To find the area under the curve y = f(x) between x = a and x = b, integrate y = f(x) between the limits of a and b.
To translate the probability density P(x) into a probability, imagine that Ix is some small interval around the point x. Then, assuming P is continuous, the probability that X is in that interval will depend both on the density P(x) and the length of the interval
P ( X ∈ Ix) ≈ P (x) × Length of Ix
We don’t have a true equality here, because the density P may vary over the interval Ix. But, the approximation becomes better and better as the interval Ix shrinks around the point x, as P will be come closer and closer to a constant inside that small interval. The probability P ( X ∈ Ix ) approaches zero as Ix shrinks down to an infinitesemally small value to the point x (consistent with our above result for single numbers), but the information about X is contained in the rate that this probability goes to zero as Ix shrinks.
So, to determine the probability that X is in any subset A of the real numbers, we simply add up the values of P(x) in the subset. By “add up,” we mean integrate the function P(x) over the set A.
The Cumulative Distribution Function of a Discrete Random Variable expresses the theoretical or observed probability of that variable being less than or equal to any given value. It equates to the sum of the probabilities of achieving that value and each successive lower value.

And now the same for Continuous Random Variable
The Cumulative Distribution Function of a Continuous Random Variable expresses the theoretical or observed probability of that variable being less than or equal to any given value. It equates to the area under the Probability Density Function curve to the left of the value in question.
Now implementing some very basic PDF with Python and Scipy
Another Jupyter Notebook to understand how PDF is different from Probability.
Other articles
March 28th, 2022
Implementation of Wasserstein GAN Architecture from Scratch read more...
March 20th, 2022
Implementation of CycleGAN Architecture from Scratch read more...
March 10th, 2022
Understanding CycleGAN Architecture read more...
March 10th, 2022
DCGAN Implementation From Scratch with PyTorch on MNIST Dataset read more...
March 10th, 2022
GoogLeNet Inception v1 Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...
February 22nd, 2022
Residual Network Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...
February 21st, 2022
Understanding nn.Linear() Layer and nn.Conv2D() Layer read more...
February 23rd, 2022
Quantization refers to techniques for computing and accessing. read more...
February 20th, 2022
Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. read more...
February 19th, 2022
Quantization refers to techniques for computing and accessing memory with lower-precision data. read more...
February 14th, 2022
LeNet5 is one of the classic Neural Network and great to start with if you are a beginner read more...
February 12th, 2022
EfficientNet is a convolutional neural network architecture and scaling method developed by Google in 2019. read more...
February 4th, 2022
Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer read more...
January 29th, 2022
Preparing your Image data for Deep Learning read more...
January 29th, 2022
Reason for using 4 * 4 * 515 Shape for the Input Dense Layer in the Generator Function. read more...
January 21st, 2022
PyTorch Implementation of Classification on Fashion-MNIST dataset which consists of a training set of 60,000 images and test set of 10,000 images read more...
December 30th, 2021
A number of tips and techniques useful for daily usage read more...
December 30th, 2021
A number of tips and techniques useful for daily usage read more...
December 28th, 2021
Given a pair of images I want to stitch them to create a panoramic scene. read more...
December 24th, 2021
TFIDF, GridSearchCV, RandomSearchCV, Decision Function of SVM, RBF Kernel, Platt Scaling to find P(Y==1|X) and SGD Classifier with Logloss and L2 regularization read more...
December 22nd, 2021
Here I will compute Confusion Matrix, F1 Score, AUC Score without using scikit-learn read more...
December 21st, 2021
Shi-Tomasi Corner Detection is an improved version of the Harris Corner Detection Algorithm. read more...
December 21st, 2021
Harris Corner Detection uses a score function to evaluate whether a point is a corner or not. First it computes the horizontal and vertical derivatives (edges) of an image, read more...
December 19th, 2021
Understanding Naive Bayes Mathematically and applying on Donor Choose Dataset read more...
December 17th, 2021
In this post, I will implement Decision Tree Algorithm on Donor Choose Dataset read more...
December 14th, 2021
Decision function is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. read more...
December 13th, 2021
A DCGAN (Deep Convolutional Generative Adversarial Network) is a direct extension of the GAN. read more...
December 12th, 2021
The Haberman's survival dataset covers cases from a study by University of Chicago's Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer. read more...
December 11th, 2021
This blog on Linear Regression is about understanding mathematically the concept of Gradient Descent function and also some EDA. read more...
December 10th, 2021
In this notebook I will go over some regular snippets and techniques of it. read more...
December 9th, 2021
Logistic regression is a probabilistic classifier similar to the Naïve Bayes read more...
December 8th, 2021
In this Notebook I shall cover the following most common Python challenges for Data Science Interviews. read more...
December 4th, 2021
Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. read more...
December 2nd, 2021
In this post I will talk about Dimensionality Reduction with t-SNE (t-Distributed Stochastic Neighbor Embedding) using the famous **Digit Recognizer Dataset** (also known as MNIST data) read more...
December 2nd, 2021
In this post I will be using Kaggle's famous **Digit Recognizer Dataset** (also known as MNIST data) to implement Dimensionality Reduction with PCA (Principle Component Analysis). read more...
November 30th, 2021
In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn. read more...
November 30th, 2021
Here I shall implementing TFIDF from scratch in pure python without using sklearn or any other similar packages read more...
November 28th, 2021
Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities. read more...
November 26th, 2021
Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. read more...
November 23rd, 2021
k-fold cross-validation is one of the most popular strategies widely used by data scientists. It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. read more...
November 20th, 2021
In this post, I will go over the mathematical need and the derivation of Chain Rule in a Backpropagation process. read more...
November 18th, 2021
The constant-Q transform transforms a data series to the frequency domain. It is related to the Fourier transform. read more...
November 15th, 2021
Most used Matrix Maths in a Nutshell read more...
November 13th, 2021
A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and normalizes by subtracting the mean and dividing by the standard deviation. read more...
November 10th, 2021
The F1 score is the harmonic mean of precision and recall, taking both metrics into account read more...
November 8th, 2021
Original number = x and Log Transformed number x=log(x) read more...
November 6th, 2021
What is cosine distance and cosine similarity read more...
November 4th, 2021
Euclidean distance is the shortest distance between two points in an N-dimensional space also known as Euclidean space. read more...
November 4th, 2021
At its core, a tensor is a container for data — almost always numerical data. read more...
November 2nd, 2021
Understanding the shape and Dimension will be one of the most crucial thing in Machine Learning and Deep Learning Project. This blog makes it clear. read more...
November 1st, 2021
Understanding the Mathematical Reasoning read more...
November 1st, 2021
Vectorized Gradient-Descent formulae for the Cost function of the for Matrix form of training-data Equations. read more...
October 25th, 2021
In this post I shall discuss the concept around and the Mathematics behind the below formulation of Bias-Variance Tradeoff. read more...
October 19th, 2021
as soon as you need to implement multi-variate Linear Regression, you hit multivariate-calculus which is what you will have to use to derive the Gradient of a set of multi-variate Linear Equations i.e. Derivative of a Matrix. read more...
October 15th, 2021
The most fundamental definition of Derivative can be stated as — derivative measures the steepness of the graph of a function at some particular point on the graph. read more...
October 6th, 2021
Using yfinance open source library access great amount of historical financial data for Free read more...
October 6th, 2021
The mean is called a measure of central tendency because it tells us something about the center of a distribution, specifically its center. read more...
October 4th, 2021
Using the historical data, I will implement a recurrent neural netwok using LSTM (Long short-term memory) layers to predict the trend of cryptocurrency values in the future. read more...
September 30th, 2021
Moving averages are one of the most often-cited data-parameter in the space of Stock market trading, technical analysis of market and is extremely useful for forecasting long-term trends. read more...
September 29th, 2021
Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. read more...
September 26th, 2021
In TensorFlow, we can define custom data augmentations(e.g. mixup, cut mix) as a custom layer using subclassing, which I will talk about in this blog read more...
September 23rd, 2021
Mixed precision training is a technique used in training a large neural network where the model’s parameter are stored in different datatype precision (FP16 vs FP32 vs FP64). It offers significant performance and computational boost by training large neural networks in lower precision formats. read more...
September 21st, 2021
For this Project — I applied LightGBM + XGBoost + CatBoost Ensemble to achieve a Top 11% score in the Kaggle Competition ( Santander Value… read more...
September 18th, 2021
Understanding from scratch how to convert an image to Grayscale and increase the Brightness of an image, working at the pixel level and some related fundamentals of image processing with Python read more...
September 13th, 2021
In this Kaggle Competition, data-science helping to find Gravitational Waves by building models to filter out noises from data-streams read more...
September 12th, 2021
The great Kaggle Competition for G2Net Gravitational Wave Detection. Here, I shall go through the fundamental introduction on Gravitational waves, and some related concepts required for this competition. read more...
September 8th, 2021
Here I apply a large number of Feature Engineering to extract features from the 500GB dataset of Microsoft Malware Classification Kaggle Competiton and then apply XGBoost to achieve a LogLoss of 0.007. read more...
September 5th, 2021
In SGD while selecting data points at each step to calculate the derivatives. SGD randomly picks one data point from the whole data set at each iteration to reduce the computations enormously read more...
September 1st, 2021
Random Search sets up a grid of hyperparameter values and selects random combinations to train the model and score. read more...
August 30th, 2021
Implementing Custom GridSearchCV without scikit-learn. read more...
August 24th, 2021
A series solving some fundamental Probability Problems in the context of DataScience and Machine Learning read more...