# SOFTWARE DEFECT PREDICTION USING CLASSIFICATION MODELS

##### SOFTWARE DEFECT PREDICTION USING CLASSIFICATION MODELS

One of the most exclusive topic in research area is bugs in software. Software companies spend ton of cash 1 on testing of software and still there exist a term called bug in today’s world. To prevent the bugs as a proactive measure, again they spend ton of cash 2 for predicting the bugs and mitigate the risks of bug in software. Once the software reaches end user, limitations due to bugs cause lot of efforts, time, money, intelligence and resources for product owner to cure the bug.

Software defect prediction is predicting the defects not defecting the predicts ğŸ™‚

There are lot of techniques used to predict bugs so that organizations can incorporate them.

This is one of them

```#Import the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as mpl
%matplotlib inline
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import plotly as ply
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score, precision_score, jaccard_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from genetic_selection import GeneticSelectionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import DBSCAN
from matplotlib import cm
from matplotlib.colors import ListedColormap
from sklearn.model_selection import KFold
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from ypstruct import structure
import random
from sklearn.metrics import classification_report
from random import randint
import warnings
warnings.filterwarnings("ignore")
import os
for path, _,file in os.walk('Y:/Data science/archive/'):
for f in file:
print(os.path.join(path, f))
```
```Y:/Data science/archive/about JM1 Dataset.txt
Y:/Data science/archive/jm1.arff
Y:/Data science/archive/jm1.csv
```

In [2]:

```colors = [ '#d62728','#2ca02c']

#Colour palette
#colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
#          '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
```

In [3]:

```#read the csv
```

In [4]:

```sd.describe().T
```

Out[4]:

In [5]:

```sd.info()
```
```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10885 entries, 0 to 10884
Data columns (total 22 columns):
#   Column             Non-Null Count  Dtype
---  ------             --------------  -----
0   loc                10885 non-null  float64
1   v(g)               10885 non-null  float64
2   ev(g)              10885 non-null  float64
3   iv(g)              10885 non-null  float64
4   n                  10885 non-null  float64
5   v                  10885 non-null  float64
6   l                  10885 non-null  float64
7   d                  10885 non-null  float64
8   i                  10885 non-null  float64
9   e                  10885 non-null  float64
10  b                  10885 non-null  float64
11  t                  10885 non-null  float64
12  lOCode             10885 non-null  int64
13  lOComment          10885 non-null  int64
14  lOBlank            10885 non-null  int64
15  locCodeAndComment  10885 non-null  int64
16  uniq_Op            10885 non-null  object
17  uniq_Opnd          10885 non-null  object
18  total_Op           10885 non-null  object
19  total_Opnd         10885 non-null  object
20  branchCount        10885 non-null  object
21  defects            10885 non-null  bool
dtypes: bool(1), float64(12), int64(4), object(5)
memory usage: 1.8+ MB
```

In [6]:

```#Convert the object type to numeric
for x in sd.iloc[:, 16:21]:
sd[x] = pd.to_numeric(sd[x], errors='coerce')
```

In [7]:

```#Determine the class imbalance in the target variable
print(sd.groupby('defects').size())
```
```defects
False    8779
True     2106
dtype: int64
```

In [8]:

```#Check Unique values on the dataset to take care of high cardinality features
sd.nunique()
```

Out[8]:

```loc                   365
v(g)                  108
ev(g)                  74
iv(g)                  82
n                     806
v                    3991
l                      55
d                    2695
i                    4268
e                    6978
b                     310
t                    6761
lOCode                291
lOComment              88
lOBlank                95
locCodeAndComment      30
uniq_Op                68
uniq_Opnd             171
total_Op              581
total_Opnd            468
branchCount           146
defects                 2
dtype: int64```

In [9]:

```#Plot the correlation using heatmap

cor = sd.corr()
sns.set(rc={'figure.figsize':(15,12)})

sns.heatmap(cor,annot=True, linewidths = .5, fmt = '.2f',cmap='cubehelix')
```

Out[9]:

`<AxesSubplot:>`

In [ ]:

```
```

In [10]:

```#Check for relations and insights between volume of code and the number of defective code
sns.set(rc={'figure.figsize':(20,12)})
fig=sns.scatterplot(data=sd,x=sd['v'],y=sd['b'], hue=sd['defects'],s=100,marker='o',palette=['#2ca02c','#d62728'])
fig.set(xlabel='Volume', ylabel='Bug')
mpl.show()

#Inference - Higher volume of code higher the bugs
```

In [11]:

```sns.set(rc={'figure.figsize':(20,12)})
fig=sns.scatterplot(data=sd,x=sd['loc'],y=sd['v'], hue=sd['defects'],s=100,marker='o',palette=['#2ca02c','#d62728'])
fig.set(xlabel='Volume', ylabel='Bug')
mpl.show()

# Bugs identified even if volume of code is low. So there is no rule of thumb that bug do not occur if volume of code is low
```

In [12]:

```sns.set(rc={'figure.figsize':(20,12)})
fig=sns.scatterplot(data=sd,x=sd['v'],y=sd['d'], hue=sd['defects'],s=100,marker='o',palette=['#2ca02c','#d62728'])
fig.set(xlabel='Volume', ylabel='Difficulty')
mpl.show()

#Difficulty is high with higher volume of code. Higher the difficulty higher the defects
```

In [13]:

```sns.set(rc={'figure.figsize':(20,12)})
fig=sns.scatterplot(data=sd,x=sd['v'],y=sd['e'], hue=sd['defects'],s=100,marker='o',palette=['#2ca02c','#d62728'])
fig.set(xlabel='Volume', ylabel='Effort')
mpl.show()

#Effort is also higher if volume of code is high. Higher the effort higher the defects
```

In [14]:

```# sns.set(rc={'figure.figsize':(15,5)})

#line of code vs time,difficulty,intelligence and effort

fig, axes = mpl.subplots(2, 2, sharex=True, figsize=(20,12))

fig = sns.lineplot(x=sd['loc'],y=sd['t'],ax=axes[0,0])
# fig.set(xlabel='Line of Code', ylabel='Time')

sns.lineplot(x=sd['loc'],y=sd['d'],palette=['r'],data=sd,ax=axes[0,1])

sns.lineplot(x=sd['loc'],y=sd['i'],ax=axes[1,0])

sns.lineplot(x=sd['loc'],y=sd['e'],ax=axes[1,1])
```

Out[14]:

`<AxesSubplot:xlabel='loc', ylabel='e'>`

In [15]:

```#3D plot to visualize the volume, time and defects

sns.set(style = "darkgrid")
f = mpl.figure(figsize=(20,12))
ax = f.add_subplot(111, projection = '3d')

x = sd['v']
y = sd['t']
z = sd['defects']

ax.set_xlabel("Volume")
ax.set_ylabel("Time")
ax.set_zlabel("Defects")
cmap = ListedColormap(sns.color_palette("husl", 256).as_hex())
ax.scatter(x, y, z, s=100, c=x, cmap=cmap, alpha=1)

mpl.show()
```

In [16]:

```#Outlier analysis
sns.set(rc={'figure.figsize':(20,12)})
sns.boxplot(x=sd['uniq_Op'],color='r')
```

Out[16]:

`<AxesSubplot:xlabel='uniq_Op'>`

In [17]:

```# % of defect and non defect
100 * (sd['defects'].value_counts()/len(sd))
```

Out[17]:

```False    80.652274
True     19.347726
Name: defects, dtype: float64```

In [18]:

```#Class imbalance of defects

N = 8910
mpl.figure(figsize=(7, 5))
ax = sns.countplot(x=sd['defects'],palette=['#2ca02c', '#d62728'])
mpl.xticks(size=12)
mpl.xlabel('Defects', size=14)
mpl.yticks(size=12)
mpl.ylabel('Percentage', size=12)
mpl.title("Class Distribution of Defects", size=16)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

t = len(sd)
for p in ax.patches:
percentage = f'{100 * p.get_height() / t:.1f}%\n'
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='center')
mpl.tight_layout()
mpl.show()

#Majority of non-defective class to minority of defective class

#minority ---> defective=19.3% < majority ---> non-defective=80.7%
```

In [19]:

```#Drop na values
sd.dropna(inplace=True)
sd.isna().any().sum()
```

Out[19]:

`0`

In [20]:

```#drop duplicates if any
sd.drop_duplicates(subset=None, inplace=True)
sd.shape
```

Out[20]:

`(8907, 22)`

In [21]:

```#Verify nulls
sd.isnull().sum()
```

Out[21]:

```loc                  0
v(g)                 0
ev(g)                0
iv(g)                0
n                    0
v                    0
l                    0
d                    0
i                    0
e                    0
b                    0
t                    0
lOCode               0
lOComment            0
lOBlank              0
locCodeAndComment    0
uniq_Op              0
uniq_Opnd            0
total_Op             0
total_Opnd           0
branchCount          0
defects              0
dtype: int64```

In [23]:

```# Plot upper triangle to remove highly correlated features
uptr = cor.where(np.triu(np.ones(cor.shape), k=1).astype(np.bool))
sns.heatmap(uptr,annot = True, linewidths = .5, fmt = '.2f')
```

Out[23]:

`<AxesSubplot:>`

In [24]:

```# Identify the feature high collinearity index
drop_feature = [column for column in uptr.columns if any(uptr[column] > 0.95)]

drop_feature
```

Out[24]:

`['v', 'b', 't', 'lOCode', 'total_Op', 'total_Opnd', 'branchCount']`

In [25]:

```sd = sd.drop(columns=sd[drop_feature])
sd
```

Out[25]:

8907 rows Ã— 15 columns

In [26]:

```#Split independent and dependent variable
X=sd.drop(["defects"],axis=1).copy()
print('*'*40)
y=sd[["defects"]]
```
```     loc  v(g)  ev(g)  iv(g)      n     l      d       i         e  lOComment  \
0    1.1   1.4    1.4    1.4    1.3  1.30   1.30    1.30      1.30          2
1    1.0   1.0    1.0    1.0    1.0  1.00   1.00    1.00      1.00          1
2   72.0   7.0    1.0    6.0  198.0  0.05  20.31   55.85  23029.10         10
3  190.0   3.0    1.0    3.0  600.0  0.06  17.06  254.87  74202.67         29
4   37.0   4.0    1.0    4.0  126.0  0.06  17.19   34.86  10297.30          1

lOBlank  locCodeAndComment  uniq_Op  uniq_Opnd
0        2                  2      1.2        1.2
1        1                  1      1.0        1.0
2        8                  1     17.0       36.0
3       28                  2     17.0      135.0
4        6                  0     11.0       16.0
****************************************
```

Out[26]:

In [27]:

```#Train test split
train_size=0.8
test_size=0.5
X_train, X_bal, y_train, y_bal = train_test_split(X, y, train_size=0.8)
X_val, X_test, y_val, y_test = train_test_split(X_bal, y_bal, test_size=0.5)
```

In [28]:

```print("X_train ===> ",X_train.shape)
print("y_train ===> ",y_train.shape)
print("X_test ===> ",X_test.shape)
print("y_test ===> ",y_test.shape)
```
```X_train ===>  (7125, 14)
y_train ===>  (7125, 1)
X_test ===>  (891, 14)
y_test ===>  (891, 1)
```

In [29]:

```#Logistic regression

lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred_lr = lr.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_lr)))
print('F1 score:', f1_score(y_test, y_pred_lr))

print('Recall:', recall_score(y_test, y_pred_lr))

print('Precision:', precision_score(y_test, y_pred_lr))
print('Jaccard:', jaccard_score(y_test, y_pred_lr))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_lr)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(lr, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_lr))
print(confusion_matrix(y_test, y_pred_lr))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_lr,y_test))
```
```Acc:0.7643097643097643
F1 score: 0.2857142857142857
Recall: 0.21428571428571427
Precision: 0.42857142857142855
Jaccard: 0.16666666666666666
Confusion Matrix:[[639  56]
[154  42]]
precision    recall  f1-score   support

False       0.81      0.92      0.86       695
True       0.43      0.21      0.29       196

accuracy                           0.76       891
macro avg       0.62      0.57      0.57       891
weighted avg       0.72      0.76      0.73       891

[[639  56]
[154  42]]
ACC:  0.7643097643097643
```

In [30]:

```#Naive bayes
nb = GaussianNB()
nb.fit(X_train,y_train)
y_pred_nb = nb.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_nb)))
print('F1 score:', f1_score(y_test, y_pred_nb))

print('Recall:', recall_score(y_test, y_pred_nb))

print('Precision:', precision_score(y_test, y_pred_nb))
print('Jaccard:', jaccard_score(y_test, y_pred_nb))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_nb)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(nb, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_nb))
print(confusion_matrix(y_test, y_pred_nb))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_nb,y_test))
```
```Acc:0.77665544332211
F1 score: 0.19433198380566802
Recall: 0.12244897959183673
Precision: 0.47058823529411764
Jaccard: 0.10762331838565023
Confusion Matrix:[[668  27]
[172  24]]
precision    recall  f1-score   support

False       0.80      0.96      0.87       695
True       0.47      0.12      0.19       196

accuracy                           0.78       891
macro avg       0.63      0.54      0.53       891
weighted avg       0.72      0.78      0.72       891

[[668  27]
[172  24]]
ACC:  0.77665544332211
```

In [31]:

```#Stochastic gradient descent

sgd= SGDClassifier()
sgd.fit(X_train,y_train)
y_pred_sgd = sgd.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_sgd)))
print('F1 score:', f1_score(y_test, y_pred_sgd))

print('Recall:', recall_score(y_test, y_pred_sgd))

print('Precision:', precision_score(y_test, y_pred_sgd))
print('Jaccard:', jaccard_score(y_test, y_pred_sgd))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_sgd)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(sgd, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_sgd))
print(confusion_matrix(y_test, y_pred_sgd))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_sgd,y_test))
```
```Acc:0.7800224466891134
F1 score: 0.0
Recall: 0.0
Precision: 0.0
Jaccard: 0.0
Confusion Matrix:[[695   0]
[196   0]]
precision    recall  f1-score   support

False       0.78      1.00      0.88       695
True       0.00      0.00      0.00       196

accuracy                           0.78       891
macro avg       0.39      0.50      0.44       891
weighted avg       0.61      0.78      0.68       891

[[695   0]
[196   0]]
ACC:  0.7800224466891134
```

In [32]:

```#K nearest neighbour

knn= KNeighborsClassifier()
knn.fit(X_train,y_train)
y_pred_knn = knn.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_knn)))
print('F1 score:', f1_score(y_test, y_pred_knn))

print('Recall:', recall_score(y_test, y_pred_knn))

print('Precision:', precision_score(y_test, y_pred_knn))
print('Jaccard:', jaccard_score(y_test, y_pred_knn))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_knn)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(knn, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_knn,y_test))
```
```Acc:0.7384960718294051
F1 score: 0.22591362126245845
Recall: 0.17346938775510204
Precision: 0.3238095238095238
Jaccard: 0.12734082397003746
Confusion Matrix:[[624  71]
[162  34]]
precision    recall  f1-score   support

False       0.79      0.90      0.84       695
True       0.32      0.17      0.23       196

accuracy                           0.74       891
macro avg       0.56      0.54      0.53       891
weighted avg       0.69      0.74      0.71       891

[[624  71]
[162  34]]
ACC:  0.7384960718294051
```

In [33]:

```#Decision tree

dt= DecisionTreeClassifier()
dt.fit(X_train,y_train)
y_pred_dt = dt.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_dt)))
print('F1 score:', f1_score(y_test, y_pred_dt))

print('Recall:', recall_score(y_test, y_pred_dt))

print('Precision:', precision_score(y_test, y_pred_dt))
print('Jaccard:', jaccard_score(y_test, y_pred_dt))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_dt)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(dt, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_dt))
print(confusion_matrix(y_test, y_pred_dt))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_dt,y_test))
```
```Acc:0.6734006734006734
F1 score: 0.34606741573033706
Recall: 0.39285714285714285
Precision: 0.3092369477911647
Jaccard: 0.20923913043478262
Confusion Matrix:[[523 172]
[119  77]]
precision    recall  f1-score   support

False       0.81      0.75      0.78       695
True       0.31      0.39      0.35       196

accuracy                           0.67       891
macro avg       0.56      0.57      0.56       891
weighted avg       0.70      0.67      0.69       891

[[523 172]
[119  77]]
ACC:  0.6734006734006734
```

In [34]:

```#Random forest

rf= RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_rf)))
print('F1 score:', f1_score(y_test, y_pred_rf))

print('Recall:', recall_score(y_test, y_pred_rf))

print('Precision:', precision_score(y_test, y_pred_rf))
print('Jaccard:', jaccard_score(y_test, y_pred_rf))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_rf)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(rf, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_pred_rf,y_test))
```
```Acc:0.7833894500561167
F1 score: 0.3275261324041812
Recall: 0.23979591836734693
Precision: 0.5164835164835165
Jaccard: 0.19583333333333333
Confusion Matrix:[[651  44]
[149  47]]
precision    recall  f1-score   support

False       0.81      0.94      0.87       695
True       0.52      0.24      0.33       196

accuracy                           0.78       891
macro avg       0.67      0.59      0.60       891
weighted avg       0.75      0.78      0.75       891

[[651  44]
[149  47]]
ACC:  0.7833894500561167
```

In [36]:

```#Support vector

svm= SVC()
svm.fit(X_train,y_train)
y_pred_svm = svm.predict(X_test)
print('Acc:{}'.format(accuracy_score(y_test,y_pred_svm)))
print('F1 score:', f1_score(y_test, y_pred_svm))

print('Recall:', recall_score(y_test, y_pred_svm))

print('Precision:', precision_score(y_test, y_pred_svm))
print('Jaccard:', jaccard_score(y_test, y_pred_svm))
print('Confusion Matrix:{}'.format(confusion_matrix(y_test,y_pred_svm)))

scoring = 'accuracy'
kfold = KFold(n_splits = 10)
cv_results = cross_val_score(svm, X_train, y_train, cv = kfold, scoring = scoring)
cv_results

#Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))
#Accuracy score
from sklearn.metrics import accuracy_score
print("ACC: ",accuracy_score(y_test,y_pred_svm))
```
```Acc:0.7800224466891134
F1 score: 0.019999999999999997
Recall: 0.01020408163265306
Precision: 0.5
Jaccard: 0.010101010101010102
Confusion Matrix:[[693   2]
[194   2]]
precision    recall  f1-score   support

False       0.78      1.00      0.88       695
True       0.50      0.01      0.02       196

accuracy                           0.78       891
macro avg       0.64      0.50      0.45       891
weighted avg       0.72      0.78      0.69       891

[[693   2]
[194   2]]
ACC:  0.7800224466891134
```

In [ ]:

`#Random forest has the highest accuracy and outperformed other classifiers`