\mbox{}

Introduction

The Story

According to dermatology.names file, "The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.

Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an anlysis of the sample under a miroscope."

Objective - Predicting The Class of Dermatology

This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient. The data sets were sourced from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/dermatology. This project has two phases. Phase I focuses on data preprocessing and exploration. Phase II emphasizes feature selection, model building, evaluating, tuning and selecting the best model. This is the Phase II report organized as follows. Section 1 describes the data sets and their attributes. Section 2 covers data pre-processing. In Section 3, I will perform feature selection, build, tune, evaluate and select the best models. The last section presents limitations and a brief summary. Compiled from Jupyter Notebook, this report contains both narratives and the Python codes used for data pre-processing and Phase II activities.

Remarks

Comprising 32 integer encoded ordinal, 1 linear and 1 categorical descriptive features, this data set does not require features scaling. Since the single categorical descriptive feature (family history) has values either 0 or 1, it will be one-hot endcoded into 2 columns, with family history and without family history.

Methodology

I will utilize the following binary classifiers to predict the class of dermatology.

  • K-Nearest Neighbors (KNN),
  • Decision Tree (DT), and,
  • Random Forest (RF)

The dermatology data set is mostly cleaned with only some missing and invalid values from the Age feature. Therefore these values will be replaced by the mean of the remaining Age values. Moreover, the single categorical descriptive feature, family history, will be converted with one-hot encoding to avoid misinterpretation of some algorithm. The whole data set will firstly split into descriptive and target features. They will then be further split into two portions, 80% and 20% for model training and testing respectively. These data portions will be utilized by all three classifiers for fair performance comparison.

I will consider 10, 20 and full set of features for best feature selection. Together with the hyperparameter search, I will be able to identify the best set of parameters for building the best performing K-nearest Neighbors, Decision Tree and Random Forest models. Classification report, confusion matrix and paired t-test are used for revealing and comparing the performance of models. Since cross-validation is a random process, paired t-test, a statistical test, is adopted for determining whether the difference in model performance is statistically significant. Finally, pipeline will be used for stacking processes such as feature selection, hyperparameter tuning via cross-validation and model fitting.

Data Sets

The UCI Machine Learning Repository provides two data sets, dermatology.data and dermatology.names. dermatology.data contains 366 rows of data. There are altogether thirty-four attributes (descriptive features) in this database.

Thirty-three of them are linear valued from 0 to 3 where 0 indicates the feature does not present, 1 and 2 are the relative intermediate values, and 3 with the largest amount possible.

One of them is nominal with either 0 or 1 where a 1 indicates at least one of these diseases has been observed in the patient's family history and 0 is otherwise.

There is also one target feature with values from 1 to 6 each represent different class of dermatology.

The average value of the 358 known Age values will replace the eight missing values for this attribute.

This dataset will be used for both training and testing data in Phase II when data models will be built with the best one selected for classifying the dermatology class of future patients.

Target Feature

Dermatology class is the target feature of this dataset summarized as follows.

  • 1 - psoriasis (112 instances)
  • 2 - seboreic dermatitis (61)
  • 3 - lichen planus (72)
  • 4 - pityriasis rosea (49)
  • 5 - cronic dermatitis (52)
  • 6 - pityriasis rubra pilaris (20)
  • class: dermatology class code 1 to 6.

Descriptive Features

There are 12 clinical and 22 histopathological attributes, altogether thirty-four descriptive features with description from the dermatology.names file as follows.

Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated)

  • c_erythema: erythema.
  • c_scaling: scaling.
  • c_dBorders: definite borders.
  • c_itching: itching.
  • c_kPhenomenon: koebner phenomenon.
  • c_pPapules: polygonal papules.
  • c_fPapules: follicular papules.
  • c_omInvolvement: oral mucosal involvement.
  • c_kneInvolvement: knee and elbow involvement.
  • c_sInvolvement: scalp involvement.
  • c_fHistory: family history, (0 or 1).
  • c_age: Age (linear).

Histopathological Attributes: (take values 0, 1, 2, 3)

  • h_mIncontinence: melanin incontinence.
  • h_eitInfiltrate: eosinophils in the infiltrate.
  • h_pInfiltrate: PNL infiltrate.
  • h_fotpDermis: fibrosis of the papillary dermis.
  • h_exocytosis: exocytosis.
  • h_acanthosis: acanthosis.
  • h_hyperkeratosis: hyperkeratosis.
  • h_parakeratosis: parakeratosis.
  • h_cotrRidges: clubbing of the rete ridges.
  • h_eotrRidges: elongation of the rete ridges.
  • h_totsEpidermis: thinning of the suprapapillary epidermis.
  • h_sPustule: spongiform pustule.
  • h_mMicroabcess: munro microabcess.
  • h_fHypergranulosis: focal hypergranulosis.
  • h_dotgLayer: disappearance of the granular layer.
  • h_vndobLayer: vacuolisation and damage of basal layer.
  • h_spongiosis: spongiosis.
  • h_saoRetes: saw-tooth appearance of retes.
  • h_fhPlug: follicular horn plug.
  • h_pParakeratosis: perifollicular parakeratosis.
  • h_imInflitrate: inflammatory monoluclear inflitrate.
  • h_bInfiltrate: band-like infiltrate.

Data Pre-processsing

Preliminaries

The data sets are downloaded from the URL into the local machine for reviewing. Since the data set dermatology.data comes with headers, it will be loaded via the URL with headers as default. The data set dermatology.names will be referenced to for the meaning of the attributes whenever necessary.

In [1]:
import pandas as pd
import numpy as np

# Read Dermatology CSV data from url
url="http://archive.ics.uci.edu/ml/machine-learning-databases/\
dermatology/dermatology.data"
df = pd.read_csv(url,header=None)

# Rename columns
df.columns = ['c_erythema','c_scaling','c_dBorders','c_itching',
              'c_kPhenomenon','c_pPapules','c_fPapules','c_omInvolvement',
              'c_kneInvolvement','c_sInvolvement','c_fHistory',
              'h_mIncontinence','h_eitInfiltrate','h_pInfiltrate',
              'h_fotpDermis','h_exocytosis','h_acanthosis',
              'h_hyperkeratosis','h_parakeratosis','h_cotrRidges',
              'h_eotrRidges','h_totsEpidermis','h_sPustule','h_mMicroabcess',
              'h_fHypergranulosis','h_dotgLayer','h_vndobLayer','h_spongiosis',
              'h_saoRetes','h_fhPlug','h_pParakeratosis','h_imInflitrate',
              'h_bInfiltrate','c_age','class']

Data Cleansing and Transformation

First, we confirmed that the feature types matched the description as outlined in the documentation.

In [2]:
# Check and change data type for each column if necessary
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
df.dtypes
Dimensions of data set:
(366, 35)

Data types of data set:
Out[2]:
c_erythema             int64
c_scaling              int64
c_dBorders             int64
c_itching              int64
c_kPhenomenon          int64
c_pPapules             int64
c_fPapules             int64
c_omInvolvement        int64
c_kneInvolvement       int64
c_sInvolvement         int64
c_fHistory             int64
h_mIncontinence        int64
h_eitInfiltrate        int64
h_pInfiltrate          int64
h_fotpDermis           int64
h_exocytosis           int64
h_acanthosis           int64
h_hyperkeratosis       int64
h_parakeratosis        int64
h_cotrRidges           int64
h_eotrRidges           int64
h_totsEpidermis        int64
h_sPustule             int64
h_mMicroabcess         int64
h_fHypergranulosis     int64
h_dotgLayer            int64
h_vndobLayer           int64
h_spongiosis           int64
h_saoRetes             int64
h_fhPlug               int64
h_pParakeratosis       int64
h_imInflitrate         int64
h_bInfiltrate          int64
c_age                 object
class                  int64
dtype: object

Display column statistics for

  • sanity check, and,
  • better understanding of the data and its distribution characteristics.
In [3]:
# load column names
colnm = list(df)

# Display column statistics for sanity check
for i in range(0, len(colnm)):
    print df[colnm[i]].value_counts().sort_index(), '\n'
0      4
1     57
2    215
3     90
Name: c_erythema, dtype: int64 

0      8
1    111
2    195
3     52
Name: c_scaling, dtype: int64 

0     59
1     93
2    168
3     46
Name: c_dBorders, dtype: int64 

0    118
1     72
2    100
3     76
Name: c_itching, dtype: int64 

0    224
1     70
2     54
3     18
Name: c_kPhenomenon, dtype: int64 

0    297
1      1
2     41
3     27
Name: c_pPapules, dtype: int64 

0    333
1     11
2     16
3      6
Name: c_fPapules, dtype: int64 

0    299
1      9
2     45
3     13
Name: c_omInvolvement, dtype: int64 

0    251
1     28
2     64
3     23
Name: c_kneInvolvement, dtype: int64 

0    264
1     30
2     56
3     16
Name: c_sInvolvement, dtype: int64 

0    320
1     46
Name: c_fHistory, dtype: int64 

0    296
1      8
2     46
3     16
Name: h_mIncontinence, dtype: int64 

0    324
1     33
2      9
Name: h_eitInfiltrate, dtype: int64 

0    235
1     69
2     55
3      7
Name: h_pInfiltrate, dtype: int64 

0    312
1      8
2     23
3     23
Name: h_fotpDermis, dtype: int64 

0    118
1     57
2    129
3     62
Name: h_exocytosis, dtype: int64 

0     10
1     71
2    210
3     75
Name: h_acanthosis, dtype: int64 

0    227
1     90
2     44
3      5
Name: h_hyperkeratosis, dtype: int64 

0     86
1    118
2    132
3     30
Name: h_parakeratosis, dtype: int64 

0    252
1     19
2     61
3     34
Name: h_cotrRidges, dtype: int64 

0    198
1     23
2     95
3     50
Name: h_eotrRidges, dtype: int64 

0    256
1     19
2     60
3     31
Name: h_totsEpidermis, dtype: int64 

0    296
1     38
2     26
3      6
Name: h_sPustule, dtype: int64 

0    286
1     37
2     33
3     10
Name: h_mMicroabcess, dtype: int64 

0    295
1     13
2     43
3     15
Name: h_fHypergranulosis, dtype: int64 

0    273
1     30
2     49
3     14
Name: h_dotgLayer, dtype: int64 

0    294
1      3
2     43
3     26
Name: h_vndobLayer, dtype: int64 

0    199
1     28
2     96
3     43
Name: h_spongiosis, dtype: int64 

0    294
1      5
2     40
3     27
Name: h_saoRetes, dtype: int64 

0    344
1     10
2      8
3      4
Name: h_fhPlug, dtype: int64 

0    345
1      4
2     13
3      4
Name: h_pParakeratosis, dtype: int64 

0     13
1     85
2    206
3     62
Name: h_imInflitrate, dtype: int64 

0    289
1      3
2     22
3     52
Name: h_bInfiltrate, dtype: int64 

0      1
10     7
12     3
13     2
15     2
16     5
17     5
18     9
19     6
20     8
21     3
22    15
23     3
24     2
25    14
26     3
27    16
28     5
29     3
30    13
31     2
32     6
33    12
34     8
35    14
36    16
37     2
38     3
39     2
40    17
      ..
42    10
43     4
44     5
45     7
46     6
47     6
48     5
49     1
50    17
51     7
52    11
53     2
55    14
56     5
57     2
58     1
60    11
61     2
62     7
63     1
64     1
65     2
67     1
68     1
7      4
70     4
75     1
8      7
9      2
?      8
Name: c_age, Length: 61, dtype: int64 

1    112
2     61
3     72
4     49
5     52
6     20
Name: class, dtype: int64 

Summary of Statistics

There is no invalid value for all attributes except the Age (c_age) which has

  • eight rows with value "?", and,
  • one row with value 0.

Missing and Invalid Values Handling

Therefore, these values will be cleaned and replaced with the average of the known age as follows.

  • replace rows having "?" in the Age feature with 0 and change data type of the Age feature to integer
  • calculate average age for rows with age > 0.
  • replace rows having 0 age with average age.
In [4]:
# 1. replace rows having "?" in the Age feature with 0, and,
#    change data type of the Age feature to integers
df['c_age'] = df['c_age'].str.replace("?","0").astype(int)

# 2. calculate average age for rows with age > 0
total_age = 0
cnt = 0
for index, row in df.iterrows():
    if (row['c_age'] > 0):
        total_age = total_age + row['c_age']
        cnt = cnt + 1
average_age = int(round(total_age / cnt))

# 3. replace rows having 0 age with average age
df['c_age'] = df['c_age'].replace(0, average_age)

Confirm that there is no missing or invalid values for the Age feature

In [5]:
print 'More missing and invalid values for the Age feature:'
print '---------------------------------------------------'
any(df.c_age < 1)
More missing and invalid values for the Age feature:
---------------------------------------------------
Out[5]:
False

Data after pre-processing

In [6]:
# display properties
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
print df.dtypes
Dimensions of data set:
(366, 35)

Data types of data set:
c_erythema            int64
c_scaling             int64
c_dBorders            int64
c_itching             int64
c_kPhenomenon         int64
c_pPapules            int64
c_fPapules            int64
c_omInvolvement       int64
c_kneInvolvement      int64
c_sInvolvement        int64
c_fHistory            int64
h_mIncontinence       int64
h_eitInfiltrate       int64
h_pInfiltrate         int64
h_fotpDermis          int64
h_exocytosis          int64
h_acanthosis          int64
h_hyperkeratosis      int64
h_parakeratosis       int64
h_cotrRidges          int64
h_eotrRidges          int64
h_totsEpidermis       int64
h_sPustule            int64
h_mMicroabcess        int64
h_fHypergranulosis    int64
h_dotgLayer           int64
h_vndobLayer          int64
h_spongiosis          int64
h_saoRetes            int64
h_fhPlug              int64
h_pParakeratosis      int64
h_imInflitrate        int64
h_bInfiltrate         int64
c_age                 int32
class                 int64
dtype: object

Randomly shows 5 rows of data for inspection after pre-processing

In [7]:
df.sample(5, random_state=999)
Out[7]:
c_erythema c_scaling c_dBorders c_itching c_kPhenomenon c_pPapules c_fPapules c_omInvolvement c_kneInvolvement c_sInvolvement ... h_dotgLayer h_vndobLayer h_spongiosis h_saoRetes h_fhPlug h_pParakeratosis h_imInflitrate h_bInfiltrate c_age class
125 2 2 1 1 0 0 0 0 0 0 ... 0 0 2 0 0 0 1 0 23 2
40 1 1 1 0 0 0 1 0 0 0 ... 0 0 3 0 0 0 1 0 51 2
90 3 2 1 3 0 0 0 0 0 0 ... 0 0 3 0 0 0 1 0 50 2
159 3 2 2 1 0 0 0 0 0 0 ... 0 0 3 0 0 0 2 0 47 2
4 2 3 2 2 2 2 0 2 0 0 ... 2 3 2 3 0 0 2 3 45 3

5 rows Ă— 35 columns

Encoding Categorical Feature

One-hot encoding will be applied to family history, the only categorical descriptive feature with values either 0 or 1. As a result, 2 new descriptive features, with family history and without family history will replace family history.

One-hot Encoding

One-hot encoding is used to convert column(s) with values of the same degree of importance into several columns of binary values to remove misinterpretation by some algorithms, e.g. the values of observed family history, 1 (yes) and 0 (no), are of the same importance.

In [8]:
# apply one-hot encoding for column family history

# create 2 new columns for family history
df['c_wfHistory'] = df['c_fHistory']        # with family history
df['c_wofHistory'] = 1 - df['c_wfHistory']  # without family history

df[['c_fHistory','c_wfHistory','c_wofHistory']].sample(5, random_state=990)
Out[8]:
c_fHistory c_wfHistory c_wofHistory
337 0 0 1
274 1 1 0
271 0 0 1
133 0 0 1
129 0 0 1
In [9]:
# drop Family History since it is now split into 2 columns
df = df.drop(['c_fHistory'], axis=1)

Scaling of Descriptive Features

Perform a min-max scaling of the descriptive features.

In [10]:
from sklearn import preprocessing

# perform min-max scaling on the descriptive features only

# extract the descriptive features
Data = df.drop(['class'], axis=1)

# keep a copy of the column names
Data_df = Data.copy()

Data_scaler = preprocessing.MinMaxScaler()
Data_scaler.fit(Data)
Data = Data_scaler.fit_transform(Data)

Load column names for the scaled features

In [11]:
pd.DataFrame(Data, columns=Data_df.columns).sample(5, random_state=999)
Out[11]:
c_erythema c_scaling c_dBorders c_itching c_kPhenomenon c_pPapules c_fPapules c_omInvolvement c_kneInvolvement c_sInvolvement ... h_vndobLayer h_spongiosis h_saoRetes h_fhPlug h_pParakeratosis h_imInflitrate h_bInfiltrate c_age c_wfHistory c_wofHistory
125 0.666667 0.666667 0.333333 0.333333 0.000000 0.000000 0.000000 0.000000 0.0 0.0 ... 0.0 0.666667 0.0 0.0 0.0 0.333333 0.0 0.235294 0.0 1.0
40 0.333333 0.333333 0.333333 0.000000 0.000000 0.000000 0.333333 0.000000 0.0 0.0 ... 0.0 1.000000 0.0 0.0 0.0 0.333333 0.0 0.647059 0.0 1.0
90 1.000000 0.666667 0.333333 1.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 ... 0.0 1.000000 0.0 0.0 0.0 0.333333 0.0 0.632353 1.0 0.0
159 1.000000 0.666667 0.666667 0.333333 0.000000 0.000000 0.000000 0.000000 0.0 0.0 ... 0.0 1.000000 0.0 0.0 0.0 0.666667 0.0 0.588235 0.0 1.0
4 0.666667 1.000000 0.666667 0.666667 0.666667 0.666667 0.000000 0.666667 0.0 0.0 ... 1.0 0.666667 1.0 0.0 0.0 0.666667 1.0 0.558824 0.0 1.0

5 rows Ă— 35 columns

Build Models

by Selecting the Best Features & Parameters

Split data into Train & Test sets

In [12]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

# Extract the target feature
target = df['class']

testsize = 0.3
y = np.array(target)
X = np.array(Data)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, \
                                                    test_size=testsize)

Prepare Pipelines for Stacking Processes

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

K-nearest Neighbour

In [14]:
from sklearn.neighbors import KNeighborsClassifier

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=999)

pipe_KNN = Pipeline([('fselector', SelectKBest(score_func=f_classif)), 
                     ('knn', KNeighborsClassifier())])

params_pipe_KNN = {'fselector__k': [10, 20, Data.shape[1]],
                   'knn__n_neighbors': range(1,5),
                   'knn__p': range(1,2)}
 
gs_pipe_KNN = GridSearchCV(pipe_KNN, 
                           params_pipe_KNN, 
                           cv=cv_method,
                           scoring='accuracy', 
                           refit='accuracy',
                           verbose=1)

Fit data with pipeline, show best parameters and score

In [15]:
gs_pipe_KNN.fit(X_train, y_train)
Fitting 15 folds for each of 12 candidates, totalling 180 fits
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:    2.2s finished
Out[15]:
GridSearchCV(cv=<sklearn.model_selection._split.RepeatedStratifiedKFold object at 0x00000000098C5748>,
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('fselector', SelectKBest(k=10, score_func=<function f_classif at 0x0000000009825A58>)), ('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'knn__p': [1], 'fselector__k': [10, 20, 35L], 'knn__n_neighbors': [1, 2, 3, 4]},
       pre_dispatch='2*n_jobs', refit='accuracy',
       return_train_score='warn', scoring='accuracy', verbose=1)
In [16]:
gs_pipe_KNN.best_params_
Out[16]:
{'fselector__k': 35L, 'knn__n_neighbors': 2, 'knn__p': 1}
In [17]:
gs_pipe_KNN.best_score_
Out[17]:
0.9622395833333334

Classification Report and Confusion Matrix

In [18]:
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_KNN.predict(X_test)
cr_KNN = classification_report(y_true, y_pred)
print cr_KNN

cm_Knn = confusion_matrix(y_test, y_pred)
print "\n", cm_Knn

cer_KNN = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_KNN.score(X_test, y_test))
print "\n", cer_KNN
Detailed classification report: 

             precision    recall  f1-score   support

          1       0.96      1.00      0.98        27
          2       0.83      0.95      0.88        20
          3       1.00      1.00      1.00        21
          4       1.00      0.78      0.88        18
          5       1.00      1.00      1.00        16
          6       1.00      1.00      1.00         8

avg / total       0.96      0.95      0.95       110


[[27  0  0  0  0  0]
 [ 1 19  0  0  0  0]
 [ 0  0 21  0  0  0]
 [ 0  4  0 14  0  0]
 [ 0  0  0  0 16  0]
 [ 0  0  0  0  0  8]]

classification error rate: 0.04545

Decision Tree

In [19]:
from sklearn.tree import DecisionTreeClassifier

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=999)

# Separate descriptive features and target feature
Data = df.drop(['class'], axis=1)
target = df['class']

pipe_DT = Pipeline([('fselector', SelectKBest(score_func=f_classif)), 
                     ('dt', DecisionTreeClassifier())])

params_pipe_DT = {'fselector__k': [10, 20, Data.shape[1]],
                   'dt__max_depth': range(1,5),
                   'dt__criterion': ['gini', 'entropy']}
 
gs_pipe_DT = GridSearchCV(pipe_DT, 
                           params_pipe_DT, 
                           cv=cv_method,
                           scoring='accuracy', 
                           refit='accuracy',
                           verbose=1)

Fit data with pipeline, show best parameters and score

In [20]:
gs_pipe_DT.fit(X_train, y_train)
Fitting 15 folds for each of 24 candidates, totalling 360 fits
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    2.8s finished
Out[20]:
GridSearchCV(cv=<sklearn.model_selection._split.RepeatedStratifiedKFold object at 0x00000000099547B8>,
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('fselector', SelectKBest(k=10, score_func=<function f_classif at 0x0000000009825A58>)), ('dt', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'fselector__k': [10, 20, 35], 'dt__max_depth': [1, 2, 3, 4], 'dt__criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit='accuracy',
       return_train_score='warn', scoring='accuracy', verbose=1)
In [21]:
gs_pipe_DT.best_params_
Out[21]:
{'dt__criterion': 'entropy', 'dt__max_depth': 4, 'fselector__k': 35}
In [22]:
gs_pipe_DT.best_score_
Out[22]:
0.9075520833333334

Classification Report and Confusion Matrix

In [23]:
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_DT.predict(X_test)
cr_DT = classification_report(y_true, y_pred)
print cr_DT

cm_DT = confusion_matrix(y_test, y_pred)
print "\n", cm_DT

cer_DT = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_DT.score(X_test, y_test))
print "\n", cer_DT
Detailed classification report: 

             precision    recall  f1-score   support

          1       0.96      1.00      0.98        27
          2       0.79      0.95      0.86        20
          3       1.00      1.00      1.00        21
          4       0.94      0.89      0.91        18
          5       1.00      0.94      0.97        16
          6       1.00      0.62      0.77         8

avg / total       0.94      0.94      0.94       110


[[27  0  0  0  0  0]
 [ 0 19  0  1  0  0]
 [ 0  0 21  0  0  0]
 [ 0  2  0 16  0  0]
 [ 0  1  0  0 15  0]
 [ 1  2  0  0  0  5]]

classification error rate: 0.06364

Random Forest

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=999)

# Separate descriptive features and target feature
Data = df.drop(['class'], axis=1)
target = df['class']

pipe_RF = Pipeline([('fselector', SelectKBest(score_func=f_classif)), 
                     ('rf', RandomForestClassifier())])

params_pipe_RF = {'fselector__k': [10, 20, Data.shape[1]],
                   'rf__max_depth': range(1,5),
                   'rf__criterion': ['gini', 'entropy']}
 
gs_pipe_RF = GridSearchCV(pipe_RF, 
                           params_pipe_RF, 
                           cv=cv_method,
                           scoring='accuracy', 
                           refit='accuracy',
                           verbose=1)

Fit data with pipeline, show best parameters and score

In [25]:
gs_pipe_RF.fit(X_train, y_train)
Fitting 15 folds for each of 24 candidates, totalling 360 fits
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:   15.2s finished
Out[25]:
GridSearchCV(cv=<sklearn.model_selection._split.RepeatedStratifiedKFold object at 0x000000000994E208>,
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('fselector', SelectKBest(k=10, score_func=<function f_classif at 0x0000000009825A58>)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'fselector__k': [10, 20, 35], 'rf__max_depth': [1, 2, 3, 4], 'rf__criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit='accuracy',
       return_train_score='warn', scoring='accuracy', verbose=1)
In [26]:
gs_pipe_RF.best_params_
Out[26]:
{'fselector__k': 35, 'rf__criterion': 'entropy', 'rf__max_depth': 4}
In [27]:
gs_pipe_RF.best_score_
Out[27]:
0.93359375

Classification Report and Confusion Matrix

In [28]:
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_RF.predict(X_test)
cr_RF = classification_report(y_true, y_pred)
print cr_RF

cm_RF = confusion_matrix(y_test, y_pred)
print "\n", cm_RF

cer_RF = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_RF.score(X_test, y_test))
print "\n", cer_RF
Detailed classification report: 

             precision    recall  f1-score   support

          1       0.93      1.00      0.96        27
          2       0.82      0.90      0.86        20
          3       1.00      1.00      1.00        21
          4       0.93      0.78      0.85        18
          5       1.00      1.00      1.00        16
          6       1.00      0.88      0.93         8

avg / total       0.94      0.94      0.94       110


[[27  0  0  0  0  0]
 [ 1 18  0  1  0  0]
 [ 0  0 21  0  0  0]
 [ 0  4  0 14  0  0]
 [ 0  0  0  0 16  0]
 [ 1  0  0  0  0  7]]

classification error rate: 0.06364

Classification Report Analysis

K-nearest Neighbors is the best model for the dermatology data set, producing the highest precision 0.96, recall 0.95 and f1-score 0.95.

In [29]:
import matplotlib.pyplot as plt
%matplotlib inline

crData = [[0.96, 0.93, 0.94], [0.95, 0.93, 0.94], [0.95, 0.94, 0.93]]
q1a = pd.DataFrame(crData, columns=['KNN','DTree','Random Forest'])

q1a.plot(kind='bar', fontsize=10)
plt.xlabel('\nType of Score', fontsize=12)
plt.title('Classification Report Analysis\n', fontsize=14)
plt.ylabel('Score', fontsize=12)

a = np.asarray([0,1,2])
labels = ['Precision', 'Recall', 'f1-score']
plt.xticks(a,labels,rotation='horizontal')
Out[29]:
([<matplotlib.axis.XTick at 0xc87aac8>,
  <matplotlib.axis.XTick at 0xc20a6d8>,
  <matplotlib.axis.XTick at 0xc2464a8>],
 <a list of 3 Text xticklabel objects>)

Classification Error Rate Analysis

K-nearest Neighbors is the best model for the dermatology data set, having the lowest classification error rate 4.545%.

In [30]:
import matplotlib.pyplot as plt
%matplotlib inline

cerData = [[4.545, 6.364, 6.364]]
q1a = pd.DataFrame(cerData, columns=['KNN','DTree','Random Forest'])

q1a.unstack().plot(kind='bar', fontsize=10)
plt.xlabel('\nModels', fontsize=12)
plt.title('Classification Error Rate Analysis\n', fontsize=14)
plt.ylabel('Error %', fontsize=12)

a = np.asarray([0,1,2])
labels = ['KNN', 'Decision Tree', 'Random Forest']
plt.xticks(a,labels,rotation='horizontal')
Out[30]:
([<matplotlib.axis.XTick at 0xc7ab4e0>,
  <matplotlib.axis.XTick at 0xc7abda0>,
  <matplotlib.axis.XTick at 0xc7c7e48>],
 <a list of 3 Text xticklabel objects>)

Comparing Performance of Classifiers Using Paired T-Tests

In [31]:
from sklearn.model_selection import cross_val_score

cv_method_ttest = RepeatedStratifiedKFold(n_splits=5, 
                                          n_repeats=5, 
                                          random_state=999)

cv_results_KNN = cross_val_score(gs_pipe_KNN.best_estimator_,
                                 X_test,
                                 y_test, 
                                 cv=cv_method_ttest, 
                                 scoring='accuracy')
cv_results_KNN.mean().round(3)
Out[31]:
0.937
In [32]:
cv_results_DT = cross_val_score(gs_pipe_DT.best_estimator_,
                                X_test,
                                y_test, 
                                cv=cv_method_ttest, 
                                scoring='accuracy')
cv_results_DT.mean().round(3)
Out[32]:
0.879
In [33]:
cv_results_RF = cross_val_score(gs_pipe_RF.best_estimator_,
                                X_test,
                                y_test, 
                                cv=cv_method_ttest, 
                                scoring='accuracy')
cv_results_RF.mean().round(3)
Out[33]:
0.916

Perform Paired T-Tests

In [34]:
from scipy import stats

print(stats.ttest_rel(cv_results_DT, cv_results_KNN).pvalue.round(3))
0.0

The p-value of the paired t-test is 0 i.e. < 0.05. Hence, at a 95% level, the difference between the Decision Tree model and the K-nearest Neighbors model is statistically significant. We can conclude that the K-nearest Neighbors model performs better than the Decision Tree model for the dermatology data set.

In [35]:
print(stats.ttest_rel(cv_results_DT, cv_results_RF).pvalue.round(3))
0.023

The p-value of the paired t-test is 0.023 i.e. < 0.05. Hence, at a 95% level, the difference between the Decision Tree model and the Random Forest model is statistically significant. We can conclude that the Random Forest model performs better than the Decision Tree model for the dermatology data set.

In [36]:
print(stats.ttest_rel(cv_results_RF, cv_results_KNN).pvalue.round(3))
0.107

The p-value of the paired t-test is 0.107 > 0.05. Hence, we conclude that, at a 95% level, the difference between the Random Forest model and the K-nearest Neighbors model is not statistically significant and the performance of these two classifiers are comparable for the dermatology data set.

Paired T-Tests Result

The paired t-test results show that the Decision Tree model is the worst performing model while the Random Forest model is comparable to the K-nearest Neighbors model. Consequently, the Random Forest and the K-nearest Neighbors models are equal best for this dermatology data set.

Limitations

Since there is only 366 rows of observation in the dermatology data set, this data set might not be able to represent the population. With such a small amount of data, I am also running the risk of overfitting or the final model has poor predictability for unseen data. Moreover, it is observed in phase I that more data is required for age 65 upwards to confirm the abnormality of increasing classes of dermatology from 65 to 70.

Summary

This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient.

Firstly, the dermatology data set will be cleaned, encoded and scaled.

  • Although the data set is quite clean, there is an outlier and eight missing values within the Age descriptive feature. They are replaced by the average value of the remaining valid Age data during the cleansing process.
  • The only categorical feature, family history, is converted with one-hot encoding to remove misinterpretation by some algorithms.
  • As all the ordinal features are already integer-encoded, there is no encoding required.
  • Min-max scaling is now performed on all the descriptive features.

Now that the data set is ready, it is being split into 2 data sets for Training (70%) and Testing (30%) purposes. Predictive models are then built upon the 3 binary classifiers K-nearest Neighbors, Decision Tree and Random Forest, are fine tuned and compared.

According to the classification reports, classification error rates, K-nearest Neighbors is the best model. However, paired t-tests result indicates that Random Forest and K-nearest Neighbors are equal best models for the dermatology data set as their difference in performance is statistically insignificant.

References

  • D., J., 2015. Fundamentals Of Machine Learning For Predictive Data Analytics: Algorithms, Worked Examples, And Case Studies (the Mit Press). The Mit Press.
  • GĂ©ron, A., 2017. Hands-on Machine Learning With Scikit-learn And Tensorflow: Concepts, Tools, And Techniques To Build Intelligent Systems. O'reilly Media.
  • Mckinney, W., 2017. Python For Data Analysis: Data Wrangling With Pandas, Numpy, And Ipython. O'reilly Media.