Theoretical Machine Learning Advanced Course: Probabilistic and Statistical Math (Part 1)

Kushajveer Singh
7 min readSep 13, 2018

Welcome to the first part of a month-long series. In this series, I would help you build your intuition towards machine learning so you will no longer be that guy that just knows how to import an API from scikit-learn.

How does this work? I would not cover the whole topics but give intuition and a give an interactive experience and in the end, I would a reference from where you can read on the topic. So let’s begin.

Copyright: https://www.lordofthecraft.net/forums/topic/152684-denied-rinraguks-et-application/

Why is there a need for uncertainty (probability) in machine learning (ML)? Generally when we are given a dataset and we fit a model on that data what we want to do actually is capture the property of dataset i.e. the underlying regularity in order to generalize well to unseen data. But the individual observations are corrupted by random noise (due to sources of variability that are themselves unobserved). This is what is termed as Polynomial Curve Fitting in ML.

In curve fitting, we usually use the maximum likelihood approach which suffers from the problem of overfitting (as we can easily increase the complexity of the model to learn the train data).

To overcome overfitting we use regularization techniques by penalizing the parameters. We do not penalize the bias term as it’s inclusion in the regularization causes the result to depend on the choice of origin.

Quick Probability Overview

Consider an event of tossing a coin. We use events to describe various possible states in our universe. So we represent the possible states as X = Event that heads come and Y = Event that tails come. Now P(X) = Probability of that particular event happening.

Takeaway, represent your probabilities as events of something happening.

In the case of ML, we have P(X = x) which means, the probability of observing the value x for our variable X. Note I changed from using the word event to a variable.

Next, we discuss expectation. One of the most important concepts. When we say E[X] (expectation of X ) we are saying, that we want to find the average of the variable X. As a quick test, what is the E[x] where x is some observed value?

Suppose x takes the value10, the average of that value is the number itself.

Copyright: https://imgflip.com/i/1xopea

Next, we discuss variance. Probably the most frustrating part of an ML project when we have to deal with large variances. Suppose we found the expectation (which we generally call mean) of some values. Then variance tells us the average distance of each value from the mean i.e. it tells how spread a distribution is.

Another important concept in probabilistic maths is the concept of prior, posterior and likelihood.

Copyright: https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors

Prior = P(X). It is probability available before we observing an event

Posterior = P(X|Y). It is the probability of X after event Y has happened.

Likelihood = P(Y|X). It tells how probable it is for the event Y to happen given the current settings i.e. X

Now when we use maximum likelihood we are adjusting the values of X to get to maximize the likelihood function P(Y|X).

Note: Follow the resource list given at the end for further reading of the topic.

Gaussian Distribution

As it is by far the most commonly used distribution used in the ML literature I use it as a base case and discuss various concepts using this distribution.

Copyright: https://makeameme.org/meme/gaussian-gaussian-is

But before that let us discuss why we need to know about probability distributions in the first place?

When we are given a dataset and we want to make predictions about new data that has not been observed then what we essentially want is a formula in which we pass the input value and get the output values as output.

Let’s see how we get to that formula. Initially, we are given input data (X). Now, this data would also come from a formula, which I name probability distribution (the original distribution is although corrupted from random noises). So we set a prior for the input data. And this prior comes in the form of a probability distribution (Gaussian in most cases).

Note: From a purely theoretical perspective we are simply following Bayesian statistics and filling in values to the Baye’s Rule.

As a quick test, if we already know a probability distribution for the input data then why we need to make complex ML models?

After assuming an input distribution we then need to assume some complexity over the model that we want to fit on the input data. Why do we want to do this? Simply because there are infinitely many figures that we draw that would pass through the given two points. It is up to us to decide whether the figure is linear, polynomial.

Now in any ML model, we have weights that we have to finetune. We can either assume constant values for these weights or we can go even a bit deeper and assume a prior for these weights also. More on this later.

Note: In this post, I am approaching ML from a statistics point of view and not the practical way of using backpropagation to finetune the values of the weights.

Now I present a general way of approaching the problem of finding the values for the variables of a prior.

The Gaussian Equation is represented as follow

Gaussian formula for a single variable

Now when we assume a Gaussian prior for input data we have to deal with two variables namely the mean and variance of the distribution and it is a common hurdle when you assume some other distribution.

So how to get the value of these variables. This is where maximum likelihood comes into play. Say you observed N values as input. Now all these N values are i.i.d. (independently identically distributed), so the combined joint probability (likelihood function) can be represented as

Joint Probability for a Gaussian for N observed inputs

After getting the likelihood function, we now want to maximize this function with respect to the variables one by one. To make our life easier we usually take the log of the above value as log is a monotonically increasing function.

Now simply take the derivative w.r.t the variables and get their values.

Maximum Likelihood Value for Mean of a Univariate Gaussian distribution
Maximum Likelihood Value for Variance of a Univariate Gaussian distribution

Note: In a fully Bayesian setting the above two values represent the prior for that variables and we can update them as we observe new data.

Why was all this important?

All of you may have heard about the MSE (mean squared error) loss function, but you may not be able to use that loss function for every situation as that loss function is derived after assuming the Gaussian prior on the input data. Similarly, other loss functions like Softmax are also derived for that particular prior.

In cases where you have to take a prior like Poisson MSE would not be a good metric to consider.

Set Prior on the variables also

Copyright: https://imgflip.com/i/1p3ai9

Another design choice that you can make is by assuming that the variable values also follow a probability distribution. How to choose that distribution?

Ideally, you can choose any distribution. But practically you want to choose a conjugate prior for the variables. Suppose your prior for the input data is Gaussian. And now you want to select a prior for the mean. You must choose a prior such that after applying the Baye’s rule the resulting distribution for the input data is still Gaussian and same for variance also. These are called conjugate priors.

Just for a reference, conjugate prior for the mean is also Gaussian and for the variance and inverse gamma for the variance.

Congratulations if you made it to the end of the post. I rarely scratched the surface but I tried to present the material in a more interactive manner focused more on building intution.

Here is the list of resources that I woudl highly recommend for learning ML:-

  1. CS229: Machine Learning by Stanford
  2. Pattern Recognition and Machine Learningby Christopher M. Bishop
  3. An Introduction to Statistical Learning: with Applications in R
  4. The Elements of Statistical Learing Data Mining, inference and Prediction

Comment down bellow for suggestions about new topics, corrections and I would be more than happy to help you if some problem arises.

Connect with me on LinkedIn.

For the next time, I would make a short post about all the technical terms used like regression. And after that I would move onto Linear Refression.

--

--

Kushajveer Singh

Software Engineer | Full-Stack Engineer | React | Next.js | PostgreSQL | Python | Trained by Senior Google Engineers in Software Engineering Best Practices