Logistic Regression Algorithm

Hi, everyone. I am Orhan Yagizer. In this article, I will work with the Logistic regression algorithm in python. Let’s get start it.

Firstly, what is a logistic regression algorithm?

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, (or ) is estimating the parameters of a logistic model (a form of binary regression).

In statistics, the (or ) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object is detected in the image would be assigned a probability between 0 and 1, with a sum of one.

As I mentioned above, logistic regression appears everywhere in our lives, that’s why it’s important to learn it and know it.

What are the differences between linear regression and logistic regression?

Sometimes these two algorithms can be confused with each other.

Linear regression is used to predict the continuous dependent variable using a given set of independent variables. It is used for solving the Regression problem. In Linear regression, we predict the value of continuous variables.

On the other hand, Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables. Logistic regression is used for solvingIn logistic Regression, we predict the values of categorical variables.

Differences

Logistic Regression Analysis with Python

Now it’s time to analyze them in python. I will mostly use sci-kit learn. I will use the Titanic data set from Kaggle. It’s a very famous ML data set. You can download the data set

Firstly, we will import the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let’s start by reading in the Titanic data set file into a pandas dataframe. My file’s name is “titanic_train.csv”. Then check the dataframe’s head.

train = pd.read_csv('titanic_train.csv')
train.head()

Let’s begin some exploratory data analysis. We’ll start by checking out missing data. We can use seaborn to create a simple heatmap to see where we are missing data.

plt.figure(figsize=(10,6))
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="Greens")

Almost one-fourth of age data is missing. Also, look at the cabin column, it looks like we are just missing too much of that data. We’ll have to drop these columns.

Let’s continue by visualizing some more of the data.

sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='pastel')
Countplot

According to the chart above, most people couldn't survive. Now let’s look at survivors by sex.

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
Countplot

As you can see, most male people didn’t survive the sinking of the Titanic. Approximately 220 female, 110 male survived.

Let’s look at survivors by class column.

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='viridis')
Countplot

As you can see, more people survived in class 1. Whereas more people died in class 3.

Let’s look at our age column.

plt.figure(figsize=(10,7))
sns.distplot(train["Age"].dropna(),kde=False,bins=30);
Distplot

As you can see, most people’s age between 20 and 30.

Now, we should make data cleaning. I want to fill in missing age data instead of just dropping the missing age data rows. We will create a function for it. I will fill in missing data with average age values.

plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='viridis')
Boxplot
def trans_age(cols):
Age = cols[0]
Pclass = cols[1]

if pd.isnull(Age):

if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age

Now, apply the function.

train['Age'] = train[['Age','Pclass']].apply(trans_age,axis=1)

Also, we should drop the cabin column due to missing data.

train.drop('Cabin',axis=1,inplace=True)
train.head()
New Dataframe

Now check the heatmap again.

plt.figure(figsize=(10,6))
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="Greens")
Heatmap for missing values

As you can see, we don't have any missing values anymore.

Before the modelling process, we should convert categorical data to dummy variables. We should get_dummies for it. Our categorical data is the sex and embarked column. Then after the dummy process, we can drop the original categorical columns.

sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

Now we should concatenate our dataframes. Then we’ll check the train’s head.

train = pd.concat([train,sex,embark],axis=1)train.head()
Modelling Data

Our data is ready for modelling. Now, we can split our data training set and test set. Our target column is Survived. We will try to predict Survived column.

I’ll use sci-kit learn. So, you should import sci-kit learn.

from sklearn.model_selection import train_test_splitX = train.drop('Survived',axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

Then, we should train and predict the data with a logistic regression algorithm. For this, we should import linear_model from sci-kit learn.

from sklearn.linear_model import LogisticRegressionlogmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Output

We train and fit our data. Now we can make a prediction.

predictions = logmodel.predict(X_test)

Let’s move on to evaluate our model. We can check precision, recall,f1-score using a classification report and confusion matrix. For this, we should import metrics from sci-kit learn.

from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test,predictions))
print("\n")
print(classification_report(y_test,predictions))
Confusion Matrix and Classification Report

Here is the result of our model. It’s not bad, the predictions are fine. But in the normal world data, our modelling process isn’t that easy. We should more EDA on normal world data.

Today, we analyzed the logistic regression algorithm with the Titanic data set in Python.

I hope, you enjoy my article and it will be useful for you. Thanks for reading!

Orhan Yağızer Çınar

Linkedin

Leader at Young Leaders Over The Horizon| YetGen 21'2| High School Advisory Board Member at GelecektekiSen | Blogger | Data Science www.orhanyagizercinar.com