Gradient Descent is one of the first algorithms you learn in machine learning. It is one of the most popular optimization algorithms for training a machine learning model.
Watch the video here:
Get the example code here - Github
Full blog post soon…
Transcript:
This iterative, first-order algorithm is used to find the local minima (or maxima) of a function. In machine learning, we use this algorithm to minimize a cost or loss function, usually in a linear regression.
To understand all of these definitions, we need to understand what a machine actually learns. In simple terms, a model learns a function F such that f(X) maps to y. You remember the slope-intercept form of a line - y = mx + b. In machine learning, your algorithm is trying to predict the values of y. To do that, it has to find values of m and b so that the predictions can be as close as possible to the actual answers. The m variable would be an R-dimensional parameter vector, the x variable would be an R-dimensional feature vector, and b would be the bias.
You can think of the parameter vector as the “weights” of the model. This parameter vector is what the model updates every iteration. The “features” in the feature vector are values used to define how the function behaves based on the input. For example, a model created to approve someone for a credit card would have many features: age, income, sex, credit score, etc. Any time you try those inputs for X, we want y to be an accurate result - To approve or deny.
To train the model, we have to find the optimal values for m and b. How do we do that? We minimize the loss/cost function. There are multiple loss/cost functions to pick from but a pretty popular one is called the mean-squared-error loss/cost function or MSE. If you look at the function, it takes the models predicted output and subtracts it by the actual output, squares it, and then takes its average (1/N) where N is the number of iterations. Let’s look at a real world example of this.
Let’s say I took an exam that had 10 questions. After the answer key was published, I decided to compare my answers to see how well I did. If my answer f(xi) and the real answer yi has a difference of 0, that means I got that answer 100% correct - and if this was a machine learning model, it would imply that I have the best model for the task. Notice that we square the function to accentuate the error, as well as get rid of any negatives. This is because the error can come from the left or the right side - too high of a prediction or too low of a prediction.
Before I continue, make sure you subscribe to my channel. This is the first video of a series I’m doing called “from scratch” where I’m going to break down machine learning. Anyways, we have enough context now to start getting into the Gradient Descent algorithm.
This algorithm allows us to minimize our loss/cost function so that we get more accurate predictions. Finding or getting close to the minima of a function allows us to fine tune the parameters to find the convergence point, where the model stops learning. You can think of the gradient as the slope of the function.
Let’s look at some of the math: First of all, the gradient descent algorithm doesn’t work on all functions. For it to work, a function has to be differentiable - which means having a derivative at every point in its domain and convex. For a univariate convex function, this means being able to create a line segment connecting two points in the function that lay on or above the curve. If it crosses the curve then that means it has a local minima and not a global one.
You can check mathematically if a univariate function is convex by taking the second derivative of the function. If it’s value is always bigger than 0, then you have a convex function. This works because it tells you if f(x) is increasing on both sides.
Let’s look at a simple quadratic function: f(x) = x^2 + 3x -1 SHOW IN DESMOS RECORD SCREEN AND MOVE AROUND Taking the first derivative we get: 2x + 3 Taking the second derivative we get: 2 Since 2>0, this function is convex.
It is possible to use this algorithm with quasiconvex functions but the algorithm has a chance to get stuck. Quasiconvex functions are functions where they are convex over certain parts of their domain, but lose that quality over other parts. This introduces things like saddle points, which are points in the function where the slope is zero, but it isn’t a local extremum. Here is an example of a function with saddle points. Multivariate functions require further calculation to determine saddle points with things like the Hessian matrix - which I’ll go over in a later video.
Anyways, back to the gradient. A gradient is the slope of a curve at a given point in a specified direction. In univariate functions, this would be the first derivative. In multivariate functions, it would be a vector of derivatives. Since we only care about the slope of one axis, we take what is called a partial derivative. Partial derivatives are derivatives in which part of an expression can be grouped into a variable not derived with the rest.
A gradient for an n-dimensional function f(x) at a given point p looks like this:
The upside down triangle is called the del or nabla. It just means we are denoting the gradient of the function f(x) at point p.
Let’s look at this 2 dimensional function: 1.5x^2+y^2 If I wanted to calculate the gradient at the point p(5,5) Taking the partial derivatives, we get 3x with respect to x And 2y with respect to y This means our gradient vector looks like this: triangle f(x, y) = [3x, 2y] Plugging in the points x= 5 and y = 5, we get the gradient vector [15, 10] This means our slope is a ⅓ less steep on the y axis
Now let’s look at the gradient descent algorithm. The algorithm iteratively calculates the next point using the gradient at the current position pn, scales it by a learning rate we can call r, and then subtracts it from the current position. This gives us the next point, or pN+1. The learning rate R is an important value, as it influences the performance of the minimization. The smaller the rate, the longer the algorithm takes to find the minima of a function. If the learning rate is too high, it might take bigger steps than needed, and miss the minimization point entirely.
Let’s simplify the process of gradient descent: First, it chooses a starting point Second, it calculates the gradient at this point Third, it jumps to the next step in the opposite direction Repeat 2 and 3 until either: the maximum number of operations are reached, or the step size is smaller than your tolerance value.
Your tolerance value is what stops the algorithm in case it reaches a close enough value where another jump wouldn’t make sense
Now that I’ve explained mostly everything, let’s code it up.