Cross Validation in Python using StatsModels and Sklearn with Logistic Regression Example

Published in

Geek Culture

6 min readJul 4, 2021

In this tutorial, we will learn what is cross validation in machine learning and how to implement it in python using StatsModels and Sklearn packages.

What is Cross Validation and why do we need it?

Cross validation is a resampling method in machine learning. To understand cross validation, we need to first review the difference between train error rate and test error rate.

Train error rate is the average error (misclassification in classification problems) that results from the same data, on which the model was trained on.

In contrast, Test error rate is the average error that results from using the trained model on unseen test data set (also known as validation dataset).

In the absence of test data, we wont be able to tell if our model is working equally good on the unseen data, which is the ultimate goal of any machine learning problem. The process of using test data to estimate the average error when the fitted/trained model is used on unseen data is called cross validation. In simple words, we cross validate our prediction on unseen data and hence the name “cross validation”.

Types of Cross Validation

Validation set approach

This approach is simplest of all. Just divide your data into two parts i.e. train and test datasets. Train your model on train dataset and run the validation on test dataset. This approach could be problematic because we are assuming that our test data represents whole data, which could be violated in practice. As a result, our test error estimates could be very unstable. In addition, since machine learning methods tend to perform worse when trained on fewer observations, it may then set validation error rate to overestimate the test error rate.

Leave-One-Out Cross Validation:

As the name suggests, we leave one observation from the training data while training the model. Technically, this approach is same as above but in our test dataset we just have 1 row. What is different is that we repeat this experiment by running a for loop and take 1 row as a test data in each iteration and get the test error for as many rows as possible and take of average of errors in the end. Ideally we should run the for loop for n number of times (where n = sample size). By this time, we can already identify the problem here. When your data is big, this method could be very inefficient.

k-Fold Cross Validation:

This is hybrid of above two types. We divide the data into k folds and run a for loop for k times taking one of the folds as a test dataset in each iteration.

We will use validation set approach and k-Fold in this tutorial.

Application

Let get our hands dirty with some coding. Are you excited?

ME TOO!

Problem Setup

We will use the heart dataset to predict the probability of heart attack using all predictors in the dataset. In doing so, we also want to estimate the test error of the logistic regression model described in that section using cross validation.

Data Set Information:

I took this dataset from Center for Machine Learning and Intelligent Systems

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer value from 0 (no presence) to 4. The “target” field refers to the presence of heart disease in the patient. If integer value is 0 = it means no/less chance of heart attack and if integer value is 1 = then it means more chance of heart attack.

The names and social security numbers of the patients was also recently removed from the database, and was replaced with dummy values.

All four unprocessed files also exist in this directory. One file that has been “processed”, is the one containing the Cleveland database.

To see Test Costs (donated by Peter Turney), please see the folder “Costs”.

Attribute used:

#3 (age)
#4 (sex)
#9 (cp)
#10 (trestbps)
#12 (chol)
#16 (fbs)
#19 (restecg)
#32 (thalach)
#38 (exang)
#40 (oldpeak)
#41 (slope)
#44 (ca)
#51 (thal)
#58 (target) (the predicted attribute)

Please note that this dataset has some missing data. For simplicity, we will just attempt complete case analysis.

import pandas as pd
import numpy as npdf = pd.read_csv('heart.csv')df.head()

Logistics Regression Model using Stat Models

The simplest and more elegant (as compare to sklearn) way to look at the initial model fit is to use statsmodels. I admire the summary report it generates in just one line of code.

from statsmodels.formula.api import logit
fit_logit = logit("target ~ age + sex + cp + trestbps +	chol + 	fbs + 	restecg +	thalach +	exang	 + oldpeak	+ slope + 	ca +	thal", df).fit()
print(fit_logit.summary())Optimization terminated successfully.
         Current function value: 0.348904
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 target   No. Observations:                  303
Model:                          Logit   Df Residuals:                      289
Method:                           MLE   Df Model:                           13
Date:                Sat, 03 Jul 2021   Pseudo R-squ.:                  0.4937
Time:                        14:47:19   Log-Likelihood:                -105.72
converged:                       True   LL-Null:                       -208.82
Covariance Type:            nonrobust   LLR p-value:                 7.262e-37
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.4505      2.571      1.342      0.180      -1.590       8.490
age           -0.0049      0.023     -0.212      0.832      -0.050       0.041
sex           -1.7582      0.469     -3.751      0.000      -2.677      -0.839
cp             0.8599      0.185      4.638      0.000       0.496       1.223
trestbps      -0.0195      0.010     -1.884      0.060      -0.040       0.001
chol          -0.0046      0.004     -1.224      0.221      -0.012       0.003
fbs            0.0349      0.529      0.066      0.947      -1.003       1.073
restecg        0.4663      0.348      1.339      0.181      -0.216       1.149
thalach        0.0232      0.010      2.219      0.026       0.003       0.044
exang         -0.9800      0.410     -2.391      0.017      -1.783      -0.177
oldpeak       -0.5403      0.214     -2.526      0.012      -0.959      -0.121
slope          0.5793      0.350      1.656      0.098      -0.106       1.265
ca            -0.7733      0.191     -4.051      0.000      -1.147      -0.399
thal          -0.9004      0.290     -3.104      0.002      -1.469      -0.332
==============================================================================

Brief Model Interpretation

We see that sex, cp, thalach, exang, oldpeak, ca, and thal variables are significantly associated (we are not inferring causality in this problem) with heart attack.

Cross Validation using Validation dataset approach

Let split our data into two sets i.e. train and test

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3, random_state = 1)

Let’s fit the model on train data and look at the test error rate for comparison.

fit_logit_train = logit("target ~ age + sex + cp + trestbps +	chol + 	fbs + 	restecg +	thalach +	exang	 + oldpeak	+ slope + 	ca +	thal", train).fit()
train_pred = fit_logit_train.predict(test)

# converting probability to labels
def convert_prob_to_label(prob, cutoff = 0.5):
    label = None
    if prob > cutoff:
        label = 1
    else:
        label = 0
    return label

pred_labels = list(map(convert_prob_to_label, train_pred))
pred_labels = np.asarray(pred_labels)Optimization terminated successfully.
         Current function value: 0.317208
         Iterations 8from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(test.target, pred_labels)

From above confusion matrix, we can calculate misclassification rate as

mis_rate = (conf_matrix[[1],[0]].flat[0] + conf_matrix[[0],[1]].flat[0])/len(test)print(f"Misclassification rate = {mis_rate :.3f}")Misclassification rate = 0.220

We got good model to start with. Please note that we did not run any model selection yet. However, this misclassification rate could be due to chance and might depend upon the test data. As a result, we might overestimate the test error rate. To get more stable estimate of test error / misclassification rate, we can use k-fold cross-validation.

k-Fold Cross Validation with Sklearn

from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X = df.loc[ : , ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']]
y = df[['target']]

cv = RepeatedKFold(n_splits=10, n_repeats= 10, random_state=1)
model = LogisticRegression()
scores = 1 - cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)import seaborn as sns
ax = sns.histplot(x=scores, kde=True)

Above histogram clearly shows us the variability in test error. Instead of providing a single number estimate of test error, its always better to provide mean and standard error of the test error for decision making.

Conclusion

print(f"Mean of misclassification error rate in test date is, {np.mean(scores) : .3f} with standard deviation = {np.std(scores) : .4f} ")Mean of misclassification error rate in test date is,  0.165 with standard deviation =  0.0693

We saw that cross-validation helps us to get stable and more robust estimates of test error. In the next blog, we will do the same thing using bootstrap method.