Kaggle Haberman's Survival Data Set - Exploratory Data Analysis

December 12th, 2021

The Kaggle Competition Page for downloading the Dataset

Now I have already downloaded the Haberman Dataset to my local Machine from Kaggle. And so I will be loading this Data from my local drive in below.

Description of the Data

The Haberman's survival dataset covers cases from a study by University of Chicago's Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer.

Label/Attribute Information:

  • Age of the patient at time of operation - numerical
  • Year of operation (based on 1900, numerical)
  • Number of positive axillary nodes detected - See note below on this (numerical)
  • Survival status (this is a class attribute) where 1 means - patient survived 5 years or longer and 2 means patient died within 5 years

A note on axillary lymph nodes and its relation with breast cancer diagnosis ?

Source

The lymphatic system is one of the body’s primary tools for fighting infection. This system contains lymph fluid and lymph nodes, which occur in critical areas in the body. Cancer cells sometimes enter and build up in the lymph system.

Lymph nodes are responsible for filtering lymph fluid and detecting chemical changes that signal an infection is present. When these filter points are in the armpit, doctors call them axillary lymph nodes.

As axillary lymph nodes are near the breasts, they are often the first location to which breast cancer spreads if it moves beyond the breast tissue.

The number of axillary lymph nodes can vary from person to person, ranging from five nodes to more than 30.

After a breast cancer diagnosis, a doctor will often check whether cancer cells have spread to the axillary lymph nodes. This can help confirm the diagnosis and staging of the cancer. Breast cancer can spread to any lymph nodes. Most often, it spreads to the axillary lymph nodes first (in the armpit), and then to the nodes in the collarbone (clavicular) or the breast (internal mammary).

1 
2
3
4
5
6
7
8
9
10
11
12
13
# In Kaggle Needed to upgrade seaborn to version 11. Else, getting error, module 'seaborn' has no attribute 'histplot'
# First verify version with print(sns.__version__). If its 10 or less, upgrade to 11
# Then restart session from the top right corner in Kaggle,
# Because, if I have already imported seaborn then I will be stuck with the version I imported until I restart session
# !pip install --upgrade pip
# !pip install seaborn --upgrade

import pandas as pd
import seaborn as sns
print(sns.__version__)
import matplotlib.pyplot as plt
import numpy as np
import pandas_profiling
1 
2
3
4
5
# https://drive.google.com/file/d/1GzeBrb6NEnFoChpSGveToFeBwLOU8dD2/view

original_df = pd.read_csv('../input/haberman.csv', names=['age', 'year', 'nodes', 'status'])
# original_df = pd.read_csv('https://raw.githubusercontent.com/rohan-paul/Multiple-Dataset/main/Haberman/haberman.csv', names=['age', 'year', 'nodes', 'status'] )
original_df.head()
[30]: 
1 
original_df.shape

[30]: 
(306, 4)
[31]: 
1 
2
original_df['status'].value_counts()
original_df.head(10)
[31]: 
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
[32]: 
1 
original_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     306 non-null    int64
 1   year    306 non-null    int64
 2   nodes   306 non-null    int64
 3   status  306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB

Panda Profiling

Quickly do an exploratory data analysis with Pandas profiling module. It also generates interactive reports in web format with the basics of analysis.

[33]: 
1 
pandas_profiling.ProfileReport(original_df)
HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=18.0), HTML(value='')))


HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))
HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))
[33]: 
[34]: 
1 
2
original_df.plot(kind='scatter', x='age', y='year')
plt.show()

The above scatter plot did not give much sensible data, hence do it slightly differently and along with using seaborn between 'age' and 'Axillary nodes'

Note on searbon FacetGrid

From its documentation

This class maps a dataset onto multiple axes arrayed in a grid of rows and columns that correspond to levels of variables in the dataset. The plots it produces are often called “lattice”, “trellis”, or “small-multiple” graphics.

It can also represent levels of a third variable with the hue parameter, which plots different subsets of data in different colors. This uses color to resolve elements on a third dimension, but only draws subsets on top of each other and will not tailor the hue parameter for the specific visualization the way that axes-level functions that accept hue will.

The basic workflow is to initialize the FacetGrid object with the dataset and the variables that are used to structure the grid. Then one or more plotting functions can be applied to each subset by calling FacetGrid.map() or FacetGrid.map_dataframe(). Finally, the plot can be tweaked with other methods to do things like change the axis labels, use different ticks, or add a legend.

[35]: 
1 
2
3
4
plt.close()
sns.set_style('whitegrid')
sns.FacetGrid(original_df, hue='status', height=6).map(plt.scatter, 'age', 'nodes').add_legend()
plt.show()

Observation from above scatter plot -

From above plot, the overlapping 1 and 2 between 'age' and 'nodes' makes the classification between these features not feasible.

Pair-Plot for getting spot estimates of variable interdependence

In our dataset the target column 'status' itself is numeric. It would hence be included in the pairgrid as a column/row. This is undesired as only the independent variables should be included in the pair-plot. To select ONLY the desired variables that shall be included in the grid, use the pairplot's vars keyword.

We can do this either by specifying the exact values the 'vars' will take by passing it a list of values or could use the below line to remove the last column

vars=original_df.columns[:-1]

For the number of PairPlots

  • Here we we will have "Number of Combinations of 3 separate things taken 2 at a time" i.e. C(3, 2) = 3. As we have 3 features and out of which we select only 2 for a single plot. Therefore the answer is 3. That is, 3!(2!(3−2)!)

  • And we dont consider the graphs below the diagonal as the ones below the diagonal is the same graphs with a reversal of axes which won’t affect our observation. So we will only analyze graphs above the diagonal

Drawbacks of Pair Plots Only useful when the number of features are not too high. With large no of features, it becomes impossible to use Pair Plots for classification in an efficient manner. In those situations, other mathematical tools like PCA and t-SNE for dimensionality reduction may be of help.

[36]: 
1 
2
3
4
5

# plt.close()
# sns.set_style('whitegrid')
# sns.pairplot(original_df, hue='status', height=6, vars=['age', 'year', 'nodes'])
# plt.show()

Now as we can see the plot coloring is not good, and the reason is that the target variable (which takes numerical values of 1 and 2 ) are not ideal in this case. It should be a 'categorical' column and should take 'yes' and 'no' as its values (for 1 and 2 respectively).

modify the target column values, so it makes sense and also convert the dtype of them to be categorical

[37]: 
1 
2
3
4
5
6
7
original_df['status'] = original_df['status'].map({1:"yes", 2:"no"})
original_df['status'] = original_df['status'].astype('category')

plt.close()
sns.set_style('whitegrid')
sns.pairplot(original_df, hue='status', height=6, vars=['age', 'year', 'nodes'])
plt.show()

Observation from the above Pair Plot

Age vs Year of Operation- Highly overlapping and thus not useful for making any classification decision.

Year when Operation was conducted vs Number of Axillary Nodes - Again, highly overlapping and thus not useful for classification analysis.

Age vs Number of Axillary Nodes - Still quite Overlapping but better than the above two cases. So we may select these 2 features for univariate analysis.

Conclusion- The age of the patient and the Number of Axillary Nodes are the best features to predict the chances of survival.

Uni-Variate analysis using histograms, PDF and CDF

  • Distribution plots are useful for visually assessing frequencies of data points distribution.

  • Typically the data points are grouped into buckets and the height of the bars representing each group increases with increase in the number of data points lie within that group. (histogram)

  • Probability Density Function (PDF) is the probability that the variable takes a value x. (smoothed version of the histogram)

[38]: 
1 
2
3
4
5
# First splitting the original dataframe into two containing 'yes' and 'no' for the survival_status
survival_status_yes = original_df[original_df.status == 'yes']
survival_status_no = original_df[original_df.status == 'no']

survival_status_yes.head()
[38]: 
age year nodes status
0 30 64 1 yes
1 30 62 3 yes
2 30 65 0 yes
3 31 59 2 yes
4 31 65 4 yes
[39]: 
1 
survival_status_no.head()
[39]: 
age year nodes status
7 34 59 0 no
8 34 66 9 no
24 38 69 21 no
34 39 66 0 no
43 41 60 23 no

A Note on Probability Density Function

The probability distribution of a continuous random variable, known as probability distribution functions, are the functions that take on continuous values. The probability of observing any single value is equal to 0 since the number of values which may be assumed by the random variable is infinite. For example, a random variable X may take all values over an interval of real numbers. Then the probability that X is in the set of outcomes A,P(A), is defined to be the area above A and under a curve. The curve, which represents a function p(x), must satisfy the following:

1: The curve has no negative values (p(x)>0 for all x)

2: The total area under the curve is equal to 1.

A curve meeting these requirements is often known as a density curve. Some examples of continuous probability distributions are normal distribution, exponential distribution, beta distribution, etc.

There’s another type of distribution that often pops up in literature which you should know about called cumulative distribution function. All random variables (discrete and continuous) have a cumulative distribution function. It is a function giving the probability that the random variable X is less than or equal to x, for every value x. For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.

Unless you are implementing these functions yourself, all these functions are available in scipy.stats.norm

e.g. for cdf, then use this code:

from scipy.stats import norm
print(norm.cdf(x, mean, std))

The area under a curve y = f(x) from x = a to x = b is the same as the integral of f(x)dx from x = a to x = b. Scipy has a quick easy way to do integrals. And the probability of finding a single point in that area cannot be one because the idea is that the total area under the curve is one (unless MAYBE it's a delta function). So you should get 0 ≤ probability of value < 1 for any particular value of interest. There may be different ways of doing it, but a conventional way is to assign confidence intervals along the x-axis, like this

.


Univariate Analysis:(Histogram, PDF, CDF)

Histogram

A histogram consists of adjacent rectangles erected on the x axis, split into discrete intervals called bins, and with an area proportional to the frequency of the occurrences for that bin. It counts the data points in each bin, and shows the bins on the x-axis and the counts on the y-axis. In our case, the bins will be the three features, 'age', 'nodes', 'year'

A density plot is a smoothed, continuous version of a histogram estimated from the data. The most common form of estimation is known as Kernel Density Estimate. It is used for visualizing the Probability Density of a continuous variable.

Kernel density estimation -> KDE represents the data using a continuous probability density curve in one or more dimensions. A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

To make density plots in seaborn, we can use histplot function because it lets us make multiple distributions with one function call. For example, we can make a density plot showing showing the count of the three features variables, on top of the corresponding histogram.

In below with (kde=True) adding a kernel density estimate to smooth the histogram, providing complementary information about the shape of the distribution

The curve shows the density plot, on top of the histogram, and the density plot is essentially a smoothed version of the histogram. The y-axis represents the density, and the histogram is normalized by default so that it has the same y-scale as the density plot.

First I will do simple Histogram between 'status' and 'age'

[40]: 
1 
2
3
sns.FacetGrid(original_df, hue='status', height=6) \
    .map(sns.histplot, 'year', kde=True) \
    .add_legend()
[40]: 
<seaborn.axisgrid.FacetGrid at 0x7f4664ee4c10>
[41]: 
1 
2
3
sns.FacetGrid(original_df, hue='status', height=6) \
    .map(sns.histplot, 'age', kde=True) \
    .add_legend()
[41]: 
<seaborn.axisgrid.FacetGrid at 0x7f4664dcd150>
[42]: 
1 
2
3
sns.FacetGrid(original_df, hue='status', height=6) \
    .map(sns.histplot, 'nodes', kde=True) \
    .add_legend()
[42]: 
<seaborn.axisgrid.FacetGrid at 0x7f4664d94a10>
[43]: 
1 
2
3
4
5
6
# And below I am doing for all the three variables
# Keeping them commented out for now
# for df in [survival_status_yes]:
#     for column in original_df.columns[:-1]:
#         sns.FacetGrid(df, hue='status', height=6).map(sns.histplot, column, kde=True).add_legend()
#         plt.show()

Observation looking at the Histogram and PDF above

  • From the above there figures, the PDF of survival status based on age is highly overlapping. Meaning, between ages 40 and 65 years, the percentage survived patients and the percentage of Not-Survived is almost same - thus this feature is not suitable for classification.

  • Same goes for survival status on the basis of the year of operation - its highly overlapping. So, using this datapoint we cannot predict anything

  • 'axillary nodes' is the most clearly discernible. Generally we can say, that people survive long if they have less axillary nodes detected and vice versa but still this datapoint as well not hugely beneficial to classify.


CDF (Cumulative Distribution Function) along with PDF

CDF will give us the cumulative plot of PDF so that you can calculate the exact percentage of patient survival status.

(Note, earlier I plotted and calculated the PDF's with seaborn, now I will be using numpy's histogram function )

A note on numpy.histogram() function

The function returns two values hist which gives the array of values of the histogram, and edge_bin which is an array of float datatype containing the bin edges having length one more than the hist.

A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)

The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.

In this example:

 np.histogram([1, 2, 1], bins=[0, 1, 2, 3])

There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.

The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).

Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:

  • a bar of height 0 for range/bin [0,1] on the X-axis,
  • a bar of height 2 for range/bin [1,2],
  • a bar of height 1 for range/bin [2,3].

Implementing numpy.histogram while reading an image

histogram, bin_edges = np.histogram(image, bins=256, range=(0, 1))

As stated above, the parameter bins determines the histogram size, or the number of “bins” to use for the histogram. We pass in 256 because we want to see the pixel count for each of the 256 possible values in the grayscale image.

[44]: 
1 
2
3
4
5
6
7
8
9
10
11
12
counts, bin_edges = np.histogram(original_df['year'], bins=10, density=True)
pdf = counts/sum(counts)
print('bin_edges ', bin_edges)
# bin_edges  [58.  59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]
print('pdf ', pdf)
# pdf  [0.20588235 0.09150327 0.08496732 0.0751634  0.09803922 0.10130719 0.09150327 0.09150327 0.08169935 0.07843137]
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['PDF of Survived', 'CDF of Survived'])
plt.xlabel('Year of Operation')
plt.show()

bin_edges  [58.  59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]
pdf  [0.20588235 0.09150327 0.08496732 0.0751634  0.09803922 0.10130719
 0.09150327 0.09150327 0.08169935 0.07843137]
[45]: 
1 
2
3
4
5
6
7
8
counts, bin_edges = np.histogram(original_df['age'], bins=10, density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['PDF of Age of Patient', 'CDF of Age of Patient'])
plt.xlabel('Age of Patient')
plt.show()
[46]: 
1 
2
3
4
5
6
7
8
9
10
# Note, in below I am using `survival_status_yes` instead of `original_df`
# Because I want to get the CDF plot of only for survived people
counts, bin_edges = np.histogram(survival_status_yes['nodes'], bins=10, density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['PDF of Axillary Lymph Nodes', 'CDF of Axillary Lymph Nodes'])
plt.xlabel('Axillary Lymph Nodes')
plt.show()

Observation from CDF curve

From above CDF (orange line) we can say that there's around 85% chance of survival if number of axillary nodes detected are < 5. And we can see as number of axillary nodes increases survival chances also reduces (from the PDF plot).

Its observed that 80% — 85% of people have good chances of survival if they have less number of axillary nodes detected and as nodes increases the survival status decreases.

And that 100% of the people are likely not to survive if they have 40 or more axillary nodes as the survival plot touches 0 from 40 onwards completely.

Box Plot and Whiskers

Box Plot depicts groups of numerical data through their quartiles ( using 25th, 50th and 75th percentiles ) and also shows the outlier in data set. These percentiles are also known as the lower quartile, median and upper quartile.

A box plot consist of 5 parameters.

  • median (Q2/50th Percentile): the middle value of the dataset.

  • first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

  • third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

  • Inter Quartile Range (IQR): 25th to the 75th percentile.

  • whiskers (shown in blue)

  • outliers (shown as green circles)

  • maximum: Q3 + 1.5*IQR

  • minimum: Q1 -1.5*IQR

img

img

Image Source

img

[47]: 
1 
sns.boxplot(x='status', y='nodes', data=original_df)
[47]: 
<AxesSubplot:xlabel='status', ylabel='nodes'>

Looking at the above box plot-

  • The lower line is the 25th percentile.

  • The middle line is at 50th (median) percentile.

  • The upper line is 75th percentile.

  • The height of the box is what is 25-th to 75th percentile and is called Inter Quantile Range.

  • The extended vertical lines beyond the colored boxes are the Whiskers, whose length is 1.5 times the Inter Quartile Range.

[48]: 
1 
sns.boxplot(x='status', y='year', data=original_df)
[48]: 
<AxesSubplot:xlabel='status', ylabel='year'>
[49]: 
1 
sns.boxplot(x='status', y='age', data=original_df)
[49]: 
<AxesSubplot:xlabel='status', ylabel='age'>

Observation from Box Plots

Below looking at the plot of survival_status vs nodes

  • Patients with axillary_nodes less then 5, tends to survive.
  • Around 80% of the patients have less than 11 axillary_nodes.
  • Almost all who survived more than 5 years after surgery had maximum positive axillary nodes about 8 - as depicted by the upper bound of the Whiskers on the survival_status vs nodes plot.
  • However, there are a lot of outlier data points related to the "nodes" feature. Specially in the segment where people survived.

Now, looking at the plot of survival_status vs age

  • The 25th to 75th values for class 'Yes' lies between 43 to 60 of age & for class 'NO' its between 47 to 61 of age.

  • Both classes (Yes and No status of survival ) distributions comparison over Age attribute seems both classes have mixed character. They have almost the same median and their 25th to 75th precentile range are also very closely matching. Hence, not helpful to distinguish the classes.

Violin Plot

Violin Plot shows the combined data distribution of PDF and box plot. The curve denotes the PDF and middle area denotes box plot

[50]: 
1 
2
3
sns.violinplot(x='status', y='nodes', data=original_df, size=8)
plt.title('Violin plot of Axillary Nodes and Survival status')
plt.show()

[51]: 
1 
2
3
sns.violinplot(x='status', y='age', data=original_df, size=8)
plt.title('Violin plot of Age and Survival status')
plt.show()
[52]: 
1 
2
3
sns.violinplot(x='status', y='year', data=original_df, size=8)
plt.title('Violin plot of Year of Operation and Survival status')
plt.show()

Observation from Violin Plot

  • For the survive-status vs nodes plot, the survival-density is most around 0-7 nodes. More the number of nodes, lesser the survival chances.

  • From the same plot, we also see that the largest percentage of patients who survived had 0 nodes. However on the other hand, a small percentage of patients who indeed had zero axillary nodes, still died within 5 years of operation. Hence absence of positive axillary nodes cannot always guarantee survival.

  • For the survive-status vs age plot it is clear that the Age attribute is normally distributed. However given significant overlap, patient's age alone is not a deciding factor determining the survival of a patient.

  • However both for age and year parameters there's a substantial overlap of data points, thus making it difficult to set a threshold to classify both classes of patients.

Contour Plot (Multivariate probability density )

A quite common type of chart in the scientific world is the contour plot or contour map. This visualization is in fact suitable for displaying three-dimensional surfaces through a contour map composed of curves closed showing the points on the surface that are located at the same level, or that have the same z value. A contour line or isoline of a function of two variables is a curve along which the function has a constant value. Contour is a cross-section of the three-dimensional graph.

[53]: 
1 
2
3
# This will be a 2D density plot
sns.jointplot(x='year', y='age', data=original_df, kind="kde" )
plt.show()
[54]: 
1 
2
3
4
# This will be a 2D density plot
sns.jointplot(x='nodes', y='age', data=original_df, kind="kde" )
plt.show()


Observation from Contour Plot

Between year 1960 and 1964, more number of operations were done on the patients in the age group 45 to 55.

Other articles

March 28th, 2022

Wasserstein GAN Implementation from Scratch with PyTorch

Implementation of Wasserstein GAN Architecture from Scratch read more...

March 20th, 2022

CycleGAN Paper Implementation from Scratch with PyTorch

Implementation of CycleGAN Architecture from Scratch read more...

March 10th, 2022

CycleGAN Architecture and Paper Walkthrough

Understanding CycleGAN Architecture read more...

March 10th, 2022

DCGAN Implementation From Scratch with PyTorch

DCGAN Implementation From Scratch with PyTorch on MNIST Dataset read more...

March 10th, 2022

GoogLeNet or Inception v1 Paper Implementation From Scratch with PyTorch

GoogLeNet Inception v1 Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...

February 22nd, 2022

ResNet Paper Implementation From Scratch with PyTorch

Residual Network Architecture Implementation From Scratch with PyTorch on CIFAR10 Dataset read more...

February 21st, 2022

Input Shape of Tensor in Neural Network for PyTorch

Understanding nn.Linear() Layer and nn.Conv2D() Layer read more...

February 23rd, 2022

Label Smoothing And CrossEntrophy Loss in PyTorch

Quantization refers to techniques for computing and accessing. read more...

February 20th, 2022

Weight decay in PyTorch and its relation with Learning Rate | L2 Regularization

Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. read more...

February 19th, 2022

Quantization in PyTorch & Mixed Precision Training

Quantization refers to techniques for computing and accessing memory with lower-precision data. read more...

February 14th, 2022

Build LeNet5 from Scratch with PyTorch

LeNet5 is one of the classic Neural Network and great to start with if you are a beginner read more...

February 12th, 2022

EfficientNet Pre-Trained with PyTorch - Covid-19 X-Ray Dataset

EfficientNet is a convolutional neural network architecture and scaling method developed by Google in 2019. read more...

February 4th, 2022

Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer

Understanding the Math behind Gram Matrix and why its needed for Neural Style Transfer read more...

January 29th, 2022

DCGAN - Generator Function - Understanding Filter Size and Input Shape

Reason for using 4 * 4 * 515 Shape for the Input Dense Layer in the Generator Function. read more...

January 21st, 2022

Deep Neural Network - Base Pytorch Model - FMNIST Dataset

PyTorch Implementation of Classification on Fashion-MNIST dataset which consists of a training set of 60,000 images and test set of 10,000 images read more...

December 30th, 2021

Learn Tensorflow Fundamentals in 45 Mints

A number of tips and techniques useful for daily usage read more...

December 30th, 2021

Numpy Useful Techniques

A number of tips and techniques useful for daily usage read more...

December 28th, 2021

Image Matching & Stitching with Image Descriptors and Creating Panoramic Scene

Given a pair of images I want to stitch them to create a panoramic scene. read more...

December 24th, 2021

ML-Algorithms from scratch in pure python without using sklearn

TFIDF, GridSearchCV, RandomSearchCV, Decision Function of SVM, RBF Kernel, Platt Scaling to find P(Y==1|X) and SGD Classifier with Logloss and L2 regularization read more...

December 22nd, 2021

Computing performance metrics in Pure Python without scikit-learn

Here I will compute Confusion Matrix, F1 Score, AUC Score without using scikit-learn read more...

December 21st, 2021

Understanding Shi-Tomasi Corner Detection Algorithm with OpenCV Python

Shi-Tomasi Corner Detection is an improved version of the Harris Corner Detection Algorithm. read more...

December 21st, 2021

Understanding Harris Corner Detection Algorithm with OpenCV Python

Harris Corner Detection uses a score function to evaluate whether a point is a corner or not. First it computes the horizontal and vertical derivatives (edges) of an image, read more...

December 19th, 2021

Feature-Engineering - Naive-Bayes with Bag-of-Words on Donor Choose Dataset

Understanding Naive Bayes Mathematically and applying on Donor Choose Dataset read more...

December 17th, 2021

Decision Tree on Donors Choose Dataset Kaggle

In this post, I will implement Decision Tree Algorithm on Donor Choose Dataset read more...

December 14th, 2021

Decision Function of SVM (Support Vector Machine) RBF Kernel - Building From Scratch

Decision function is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. read more...

December 13th, 2021

DCGAN from Scratch with Tensorflow Keras — Create Fake Images from CELEB-A Dataset

A DCGAN (Deep Convolutional Generative Adversarial Network) is a direct extension of the GAN. read more...

December 11th, 2021

Kaggle House Prices Prediction Competition - Mathematics of Linear Regression

This blog on Linear Regression is about understanding mathematically the concept of Gradient Descent function and also some EDA. read more...

December 10th, 2021

Common Pytorch Snippets

In this notebook I will go over some regular snippets and techniques of it. read more...

December 9th, 2021

Understanding in Details the Mathematics of Logistic Regression and implementing From Scratch with pure Python

Logistic regression is a probabilistic classifier similar to the Naïve Bayes read more...

December 8th, 2021

Most Common Python Challenges asked in Data Science Interview

In this Notebook I shall cover the following most common Python challenges for Data Science Interviews. read more...

December 4th, 2021

Multi ArmedBandit - Kaggle Santa Competition

Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. read more...

December 2nd, 2021

Dimensionality Reduction with t-SNE

In this post I will talk about Dimensionality Reduction with t-SNE (t-Distributed Stochastic Neighbor Embedding) using the famous **Digit Recognizer Dataset** (also known as MNIST data) read more...

December 2nd, 2021

Dimensionality Reduction with PCA

In this post I will be using Kaggle's famous **Digit Recognizer Dataset** (also known as MNIST data) to implement Dimensionality Reduction with PCA (Principle Component Analysis). read more...

November 30th, 2021

TF-IDF Model and its implementation with Scikit-learn

In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn. read more...

November 30th, 2021

TFIDF from scratch in pure python without using sklearn

Here I shall implementing TFIDF from scratch in pure python without using sklearn or any other similar packages read more...

November 28th, 2021

Why Platt Scaling and implementation from scratch

Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities. read more...

November 26th, 2021

Why Bootstrapping is useful and Implementation of Bootstrap Sampling in Random Forests

Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. read more...

November 23rd, 2021

What K-Fold Cross Validation really is in Machine Learning in simple terms

k-fold cross-validation is one of the most popular strategies widely used by data scientists. It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. read more...

November 20th, 2021

Neural Network — Implementing Backpropagation using the Chain Rule

In this post, I will go over the mathematical need and the derivation of Chain Rule in a Backpropagation process. read more...

November 18th, 2021

Constant-Q transform with Gravitational Waves Data

The constant-Q transform transforms a data series to the frequency domain. It is related to the Fourier transform. read more...

November 15th, 2021

All the basic Matrix Algebra you will need in Data Science

Most used Matrix Maths in a Nutshell read more...

November 13th, 2021

Batch Normalization in Deep Learning

A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and normalizes by subtracting the mean and dividing by the standard deviation. read more...

November 10th, 2021

Why F1 Score uses Harmonic Mean and not an Arithmetic Mean

The F1 score is the harmonic mean of precision and recall, taking both metrics into account read more...

November 8th, 2021

Why we do log transformation of variables and interpretation of Logloss

Original number = x and Log Transformed number x=log(x) read more...

November 6th, 2021

Location-Sensitive-Hashing-for-cosine-similarity

What is cosine distance and cosine similarity read more...

November 4th, 2021

Euclidean Distance and Normalization of a Vector

Euclidean distance is the shortest distance between two points in an N-dimensional space also known as Euclidean space. read more...

November 4th, 2021

Data Representations For Neural Networks - Understanding Tensor, Vector and Scaler

At its core, a tensor is a container for data — almost always numerical data. read more...

November 2nd, 2021

Axes and Dimensions in Numpy and Pandas Array

Understanding the shape and Dimension will be one of the most crucial thing in Machine Learning and Deep Learning Project. This blog makes it clear. read more...

November 1st, 2021

Vectorizing Gradient Descent — Multivariate Linear Regression and Python implementation

Vectorized Gradient-Descent formulae for the Cost function of the for Matrix form of training-data Equations. read more...

October 25th, 2021

Understanding Bias-Variance Trade-off

In this post I shall discuss the concept around and the Mathematics behind the below formulation of Bias-Variance Tradeoff. read more...

October 19th, 2021

Fundamentals of Multivariate Calculus for DataScience and Machine Learning

as soon as you need to implement multi-variate Linear Regression, you hit multivariate-calculus which is what you will have to use to derive the Gradient of a set of multi-variate Linear Equations i.e. Derivative of a Matrix. read more...

October 15th, 2021

Concepts of Differential Calculus for understanding the derivation of Gradient Descent in Linear Regression

The most fundamental definition of Derivative can be stated as — derivative measures the steepness of the graph of a function at some particular point on the graph. read more...

October 9th, 2021

Discrete vs Continuous Probability Distributions in context of Data Science

Discrete and Continuos Random Variable and related Probabilities read more...

October 6th, 2021

yfinance to get historical financials from free and Tesla Stock Prediction

Using yfinance open source library access great amount of historical financial data for Free read more...

October 6th, 2021

The mean, the median, and mode of a Random Variable

The mean is called a measure of central tendency because it tells us something about the center of a distribution, specifically its center. read more...

October 4th, 2021

BitCoin Price Prediction using Deep Learning (LSTM Model)

Using the historical data, I will implement a recurrent neural netwok using LSTM (Long short-term memory) layers to predict the trend of cryptocurrency values in the future. read more...

September 30th, 2021

Moving Averages with Bitcoin Historical Data

Moving averages are one of the most often-cited data-parameter in the space of Stock market trading, technical analysis of market and is extremely useful for forecasting long-term trends. read more...

September 29th, 2021

Categorical Data Handling in Machine Learning Projects

Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. read more...

September 26th, 2021

Tensorflow-Keras Custom Layers and Fundamental Building Blocks For Training Model

In TensorFlow, we can define custom data augmentations(e.g. mixup, cut mix) as a custom layer using subclassing, which I will talk about in this blog read more...

September 23rd, 2021

Tensorflow Mixed Precision Training — CIFAR10 dataset Implementation

Mixed precision training is a technique used in training a large neural network where the model’s parameter are stored in different datatype precision (FP16 vs FP32 vs FP64). It offers significant performance and computational boost by training large neural networks in lower precision formats. read more...

September 21st, 2021

Predict Transaction Value of a Bank’s Potential Customers — Kaggle Santander Competition

For this Project — I applied LightGBM + XGBoost + CatBoost Ensemble to achieve a Top 11% score in the Kaggle Competition ( Santander Value… read more...

September 18th, 2021

Python Image Processing - Grayscale Conversion and Brightness Increase from Scratch

Understanding from scratch how to convert an image to Grayscale and increase the Brightness of an image, working at the pixel level and some related fundamentals of image processing with Python read more...

September 13th, 2021

Gravitational Waves Kaggle Competition Part-2 - EDA and Baseline TensorFlow / Keras Model

In this Kaggle Competition, data-science helping to find Gravitational Waves by building models to filter out noises from data-streams read more...

September 12th, 2021

Gravitational Waves Detection - Kaggle Competition Part-1

The great Kaggle Competition for G2Net Gravitational Wave Detection. Here, I shall go through the fundamental introduction on Gravitational waves, and some related concepts required for this competition. read more...

September 8th, 2021

Microsoft Malware Detection Kaggle Challenge

Here I apply a large number of Feature Engineering to extract features from the 500GB dataset of Microsoft Malware Classification Kaggle Competiton and then apply XGBoost to achieve a LogLoss of 0.007. read more...

September 5th, 2021

Implementing Stochastic Gradient Descent Classifier with Logloss and L2 regularization without sklearn

In SGD while selecting data points at each step to calculate the derivatives. SGD randomly picks one data point from the whole data set at each iteration to reduce the computations enormously read more...

September 1st, 2021

Implementing RandomSearchCV from scratch (without scikit-learn)

Random Search sets up a grid of hyperparameter values and selects random combinations to train the model and score. read more...

August 30th, 2021

Implementing Custom GridSearchCV from scratch (without scikit-learn)

Implementing Custom GridSearchCV without scikit-learn. read more...

August 24th, 2021

Probability Problem-Solution series for Data Science -Part-1

A series solving some fundamental Probability Problems in the context of DataScience and Machine Learning read more...