Visualizing Data for Classification

In this lab, we’ll explore the German bank credit dataset to understand relationships for a classification problem. Unlike regression problems where the label is a continuous variable, classification problems involve categorical labels. We aim to visually explore the data to identify features useful in predicting customers with bad credit.

Load and Prepare the Dataset

Let’s start by loading the necessary packages and the dataset. The dataset contains information about bank customers, including both numeric and categorical features. The goal is to predict whether a customer has bad credit or not.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import numpy.random as nr
import math

%matplotlib inline

credit = pd.read_csv('German_Credit.csv', header=None)
credit.columns = ['customer_id', 'checking_account_status', 'loan_duration_mo', 'credit_history', 
                   'purpose', 'loan_amount', 'savings_account_balance', 'time_employed_yrs', 
                   'payment_pcnt_income','gender_status', 'other_signators', 'time_in_residence', 
                   'property', 'age_yrs', 'other_credit_outstanding', 'home_ownership', 
                   'number_loans', 'job_category', 'dependents', 'telephone', 'foreign_worker', 
                   'bad_credit']

credit.drop(['customer_id'], axis=1, inplace=True)

Now, we have 21 columns, including 20 features and the label column (‘bad_credit’). Let’s proceed by recoding the categorical features for better understanding.

# Recoding categorical features
code_list = [['checking_account_status', {'A11': '< 0 DM', 'A12': '0 - 200 DM', ... }],
             ['credit_history', {'A30': 'no credit - paid', 'A31': 'all loans at bank paid', ... }],
             ...]

for col_dic in code_list:
    col = col_dic[0]
    dic = col_dic[1]
    credit[col] = [dic[x] for x in credit[col]]

Now, the categorical features have more human-readable codes.

Examine Classes and Class Imbalance

Before visualizing, let’s check for class imbalance in the label (‘bad_credit’).

credit_counts = credit['bad_credit'].value_counts() print(credit_counts)

There are 710 cases with good credit and 302 cases with bad credit, indicating some class imbalance.

Visualize Class Separation by Numeric Features

We’ll visualize the separation quality of numeric features using box plots.

num_cols = ['loan_duration_mo', 'loan_amount', 'payment_pcnt_income', 'age_yrs', 'number_loans', 'dependents']
plot_box(credit, num_cols)

Interpretation:

Features like loan_duration_mo, loan_amount, and payment_pcnt_income show useful separation between good and bad credit customers.
On the other hand, age_yrs, number_loans, and dependents seem less useful for separation.

We can also use violin plots for a different perspective.

plot_violin(credit, num_cols)

Visualize Class Separation by Categorical Features

Now, we’ll visualize the ability of categorical features to separate classes using bar plots.

cat_cols = ['checking_account_status', 'credit_history', 'purpose', 'savings_account_balance', 
            'time_employed_yrs', 'gender_status', 'other_signators', 'property', 
            'other_credit_outstanding', 'home_ownership', 'job_category', 'telephone', 
            'foreign_worker']

credit['dummy'] = np.ones(shape=credit.shape[0])
for col in cat_cols:
    plot_categorical_feature(credit, col)

Interpretation:

Some features like checking_account_status and credit_history have significantly different distributions between good and bad credit customers.
Others like gender_status and telephone show small differences that might not be significant.
Features with a dominant category, such as other_signators, foreign_worker, home_ownership, and job_category, may have limited power for separation.

Summary

In this lab, we explored and visualized a classification dataset, examining class imbalance, and identifying numeric and categorical features useful for class separation. Understanding these relationships is crucial for building effective classification models.

Unveiling the Power of Principal Component Analysis (PCA)

ByKishore January 9, 2024May 27, 2024

Introduction In the vast landscape of machine learning, we’ve delved into supervised learning methods for predicting labels based on labeled training data. Now, let’s embark on a journey into the realm of unsupervised learning. Here, the focus is on algorithms that uncover intriguing aspects of data without relying on any known labels. One such workhorse…

Data Analytics

Numeric Types

Byuser August 17, 2023August 19, 2023

In Python, numeric data type represent the data which has numeric value. Numeric value can be integer, floating number or even complex numbers. These values are defined as int, float and complex class in Python. 1) Integers – This value is represented by int class. It contains positive or negative whole numbers (without fraction or…

Machine Learning

Composite Estimators using scikit-learn: A Comprehensive Guide

ByKishore February 1, 2024May 26, 2024

Agenda 1. Introduction to Composite Estimators Composite Estimators in scikit-learn involve connecting one or more transformers with estimators to create a comprehensive model. These composite transformers are implemented using the Pipeline class, while FeatureUnion is used to concatenate the output of transformers to create derived features. Pipelines enhance code reusability and modularity in machine learning…

Machine Learning

The Mathematics Behind Machine Learning

ByKishore March 2, 2024May 27, 2024

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make decisions or predictions without being explicitly programmed. At the core of machine learning algorithms lie mathematical concepts and principles that drive their functionality. In this blog post, we’ll explore some key mathematical concepts behind machine learning. Linear Algebra…

Machine Learning

How Decision Tree works

Byuser August 20, 2023September 1, 2023

Decision Tree:* Decision Tree is a non-parametric supervised learning method for regression & classification.* It”s similar to playing “dumb charades”.* A good algorithm will have less & right questions compared to not-so-good one.* The nodes are questions & leafs are prediction. Decision Tree Algorithm:* Decision Tree is based on CART which is advancement of ID3,…

Data Analytics

A Comprehensive Guide to Array Handling and Advanced Operations using Numpy

ByKishore January 5, 2024January 5, 2024

Numpy, Your Gateway to Powerful Array Manipulation in Python If you’re venturing into the realm of scientific computing or data analysis with Python, Numpy is your trusted companion. This library is tailored for multidimensional array operations, offering features like seamless data consistency checks, efficient memory usage, and lightning-fast vector arithmetic. In this comprehensive guide, we’ll…

Load and Prepare the Dataset

Examine Classes and Class Imbalance

Visualize Class Separation by Numeric Features

Visualize Class Separation by Categorical Features

Summary

Similar Posts

Leave a Reply Cancel reply