Logistic Regression Using SGD from Scratch

3 min readJan 18, 2021

While Python’s Scikit-learn library provides the easy-to-use and efficient SGDClassifier , the objective of this post is to create an own implementation using without using sklearn. Implementing basic models is a great idea to improve your comprehension about how they work.

Data set

Create a custom dataset using make_classification inbuilt function from sklearn.

Above code generates dataset with shape of X with (50000, 15) and y (50000,))

Logistic Regression

Input values (x) are combined linearly using weights or coefficient values to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value. To generate the binary values 0 or 1 , here we use sigmoid function.

Loss function

Log Loss is the most important classification metric based on probabilities. For any given problem, a lower log-loss value means better predictions. Log Loss is a slight twist on something called the Likelihood Function. In fact, Log Loss is -1 * the log of the likelihood function.

SGD classifier

SGD is a optimization method, SGD Classifier implements regularized linear models with Stochastic Gradient Descent. Stochastic gradient descent considers only 1 random point ( batch size=1 )while changing weights. Logistic Regression by default uses Gradient Descent and as such it would be better to use SGD Classifier on larger data sets ( 50000 entries ).

By default, the SGD Classifier does not perform as well as the Logistic Regression. It requires some hyper parameter tuning to be done.

Gradient descent

Our goal is to minimize the loss function and to minimize the loss function we have to increasing/decreasing the weights, i.e. fitting them. That can be achieved by the derivative of the loss function with respect to each weight. Derivatives of weights gives us clear picture how loss changes with parameters.

we update the weights by substracting to them the derivative times the learning rate.

w = w + (eta0 * dw)

We should repeat this steps several times until we reach the optimal solution( minimal log loss).

Predictions

In the sigmoid function we get the probability that some input x belongs to class 1 based on the threshold value. Let’s take all probabilities ≥ 0.5 = class 1 and all probabilities < 0 = class 0. This threshold should be defined depending on the business problem we were working.

Under one umbrella

Implemented with L2 regularization.