Logistic Regression Algorithm

Hi, everyone. I am Orhan Yagizer. In this article, I will work with the Logistic regression algorithm in python. Let’s get start it.

Firstly, what is a logistic regression algorithm?


As I mentioned above, logistic regression appears everywhere in our lives, that’s why it’s important to learn it and know it.

What are the differences between linear regression and logistic regression?

Sometimes these two algorithms can be confused with each other.


Logistic Regression Analysis with Python

Now it’s time to analyze them in python. I will mostly use sci-kit learn. I will use the Titanic data set from Kaggle. It’s a very famous ML data set. You can download the data set from here.

Firstly, we will import the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let’s start by reading in the Titanic data set file into a pandas dataframe. My file’s name is “titanic_train.csv”. Then check the dataframe’s head.

train = pd.read_csv('titanic_train.csv')

Let’s begin some exploratory data analysis. We’ll start by checking out missing data. We can use seaborn to create a simple heatmap to see where we are missing data.


Almost one-fourth of age data is missing. Also, look at the cabin column, it looks like we are just missing too much of that data. We’ll have to drop these columns.

Let’s continue by visualizing some more of the data.


According to the chart above, most people couldn't survive. Now let’s look at survivors by sex.


As you can see, most male people didn’t survive the sinking of the Titanic. Approximately 220 female, 110 male survived.

Let’s look at survivors by class column.


As you can see, more people survived in class 1. Whereas more people died in class 3.

Let’s look at our age column.


As you can see, most people’s age between 20 and 30.

Now, we should make data cleaning. I want to fill in missing age data instead of just dropping the missing age data rows. We will create a function for it. I will fill in missing data with average age values.

plt.figure(figsize=(12, 7))
def trans_age(cols):
Age = cols[0]
Pclass = cols[1]

if pd.isnull(Age):

if Pclass == 1:
return 37
elif Pclass == 2:
return 29
return 24
return Age

Now, apply the function.

train['Age'] = train[['Age','Pclass']].apply(trans_age,axis=1)

Also, we should drop the cabin column due to missing data.

New Dataframe

Now check the heatmap again.

Heatmap for missing values

As you can see, we don't have any missing values anymore.

Before the modelling process, we should convert categorical data to dummy variables. We should get_dummies for it. Our categorical data is the sex and embarked column. Then after the dummy process, we can drop the original categorical columns.

sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

Now we should concatenate our dataframes. Then we’ll check the train’s head.

train = pd.concat([train,sex,embark],axis=1)train.head()
Modelling Data

Our data is ready for modelling. Now, we can split our data training set and test set. Our target column is Survived. We will try to predict Survived column.

I’ll use sci-kit learn. So, you should import sci-kit learn.

from sklearn.model_selection import train_test_splitX = train.drop('Survived',axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

Then, we should train and predict the data with a logistic regression algorithm. For this, we should import linear_model from sci-kit learn.

from sklearn.linear_model import LogisticRegressionlogmodel = LogisticRegression()

We train and fit our data. Now we can make a prediction.

predictions = logmodel.predict(X_test)

Let’s move on to evaluate our model. We can check precision, recall,f1-score using a classification report and confusion matrix. For this, we should import metrics from sci-kit learn.

from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test,predictions))
Confusion Matrix and Classification Report

Here is the result of our model. It’s not bad, the predictions are fine. But in the normal world data, our modelling process isn’t that easy. We should more EDA on normal world data.

Today, we analyzed the logistic regression algorithm with the Titanic data set in Python.

I hope, you enjoy my article and it will be useful for you. Thanks for reading!

Orhan Yağızer Çınar


Founder at Codecort |Leader at Young Leaders Over The Horizon| YetGen 21'2| Advisory Board Member at GelecektekiSen| Blogger| Data Science orhanyagizercinar.com