\mbox{}
According to dermatology.names
file, "The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.
Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an anlysis of the sample under a miroscope."
This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient. The data sets were sourced from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/dermatology. This project has two phases. Phase I focuses on data preprocessing and exploration. Phase II emphasizes feature selection, model building, evaluating, tuning and selecting the best model. This is the Phase II report organized as follows. Section 1 describes the data sets and their attributes. Section 2 covers data pre-processing. In Section 3, I will perform feature selection, build, tune, evaluate and select the best models. The last section presents limitations and a brief summary. Compiled from Jupyter Notebook, this report contains both narratives and the Python
codes used for data pre-processing and Phase II activities.
Comprising 32 integer encoded ordinal, 1 linear and 1 categorical descriptive features, this data set does not require features scaling. Since the single categorical descriptive feature (family history) has values either 0 or 1, it will be one-hot endcoded into 2 columns, with family history
and without family history
.
I will utilize the following binary classifiers to predict the class of dermatology.
The dermatology data set is mostly cleaned with only some missing and invalid values from the Age feature. Therefore these values will be replaced by the mean of the remaining Age values. Moreover, the single categorical descriptive feature, family history, will be converted with one-hot encoding to avoid misinterpretation of some algorithm. The whole data set will firstly split into descriptive and target features. They will then be further split into two portions, 80% and 20% for model training and testing respectively. These data portions will be utilized by all three classifiers for fair performance comparison.
I will consider 10, 20 and full set of features for best feature selection. Together with the hyperparameter search, I will be able to identify the best set of parameters for building the best performing K-nearest Neighbors, Decision Tree and Random Forest models. Classification report, confusion matrix and paired t-test are used for revealing and comparing the performance of models. Since cross-validation is a random process, paired t-test, a statistical test, is adopted for determining whether the difference in model performance is statistically significant. Finally, pipeline will be used for stacking processes such as feature selection, hyperparameter tuning via cross-validation and model fitting.
The UCI Machine Learning Repository provides two data sets, dermatology.data
and dermatology.names
. dermatology.data contains 366 rows of data. There are altogether thirty-four attributes (descriptive features) in this database.
Thirty-three of them are linear valued from 0 to 3 where 0 indicates the feature does not present, 1 and 2 are the relative intermediate values, and 3 with the largest amount possible.
One of them is nominal with either 0 or 1 where a 1 indicates at least one of these diseases has been observed in the patient's family history and 0 is otherwise.
There is also one target feature with values from 1 to 6 each represent different class of dermatology.
The average value of the 358 known Age
values will replace the eight missing values for this attribute.
This dataset will be used for both training and testing data in Phase II when data models will be built with the best one selected for classifying the dermatology class of future patients.
Dermatology class is the target feature of this dataset summarized as follows.
class
: dermatology class code 1 to 6.There are 12 clinical and 22 histopathological attributes, altogether thirty-four descriptive features with description from the dermatology.names
file as follows.
Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated)
c_erythema
: erythema.c_scaling
: scaling.c_dBorders
: definite borders.c_itching
: itching.c_kPhenomenon
: koebner phenomenon.c_pPapules
: polygonal papules.c_fPapules
: follicular papules.c_omInvolvement
: oral mucosal involvement.c_kneInvolvement
: knee and elbow involvement.c_sInvolvement
: scalp involvement.c_fHistory
: family history, (0 or 1).c_age
: Age (linear).Histopathological Attributes: (take values 0, 1, 2, 3)
h_mIncontinence
: melanin incontinence.h_eitInfiltrate
: eosinophils in the infiltrate.h_pInfiltrate
: PNL infiltrate.h_fotpDermis
: fibrosis of the papillary dermis.h_exocytosis
: exocytosis.h_acanthosis
: acanthosis.h_hyperkeratosis
: hyperkeratosis.h_parakeratosis
: parakeratosis.h_cotrRidges
: clubbing of the rete ridges.h_eotrRidges
: elongation of the rete ridges.h_totsEpidermis
: thinning of the suprapapillary epidermis.h_sPustule
: spongiform pustule.h_mMicroabcess
: munro microabcess.h_fHypergranulosis
: focal hypergranulosis.h_dotgLayer
: disappearance of the granular layer.h_vndobLayer
: vacuolisation and damage of basal layer.h_spongiosis
: spongiosis.h_saoRetes
: saw-tooth appearance of retes.h_fhPlug
: follicular horn plug.h_pParakeratosis
: perifollicular parakeratosis.h_imInflitrate
: inflammatory monoluclear inflitrate.h_bInfiltrate
: band-like infiltrate.The data sets are downloaded from the URL into the local machine for reviewing. Since the data set dermatology.data
comes with headers, it will be loaded via the URL with headers as default. The data set dermatology.names
will be referenced to for the meaning of the attributes whenever necessary.
import pandas as pd
import numpy as np
# Read Dermatology CSV data from url
url="http://archive.ics.uci.edu/ml/machine-learning-databases/\
dermatology/dermatology.data"
df = pd.read_csv(url,header=None)
# Rename columns
df.columns = ['c_erythema','c_scaling','c_dBorders','c_itching',
'c_kPhenomenon','c_pPapules','c_fPapules','c_omInvolvement',
'c_kneInvolvement','c_sInvolvement','c_fHistory',
'h_mIncontinence','h_eitInfiltrate','h_pInfiltrate',
'h_fotpDermis','h_exocytosis','h_acanthosis',
'h_hyperkeratosis','h_parakeratosis','h_cotrRidges',
'h_eotrRidges','h_totsEpidermis','h_sPustule','h_mMicroabcess',
'h_fHypergranulosis','h_dotgLayer','h_vndobLayer','h_spongiosis',
'h_saoRetes','h_fhPlug','h_pParakeratosis','h_imInflitrate',
'h_bInfiltrate','c_age','class']
First, we confirmed that the feature types matched the description as outlined in the documentation.
# Check and change data type for each column if necessary
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
df.dtypes
# load column names
colnm = list(df)
# Display column statistics for sanity check
for i in range(0, len(colnm)):
print df[colnm[i]].value_counts().sort_index(), '\n'
There is no invalid value for all attributes except the Age (c_age) which has
Therefore, these values will be cleaned and replaced with the average of the known age as follows.
# 1. replace rows having "?" in the Age feature with 0, and,
# change data type of the Age feature to integers
df['c_age'] = df['c_age'].str.replace("?","0").astype(int)
# 2. calculate average age for rows with age > 0
total_age = 0
cnt = 0
for index, row in df.iterrows():
if (row['c_age'] > 0):
total_age = total_age + row['c_age']
cnt = cnt + 1
average_age = int(round(total_age / cnt))
# 3. replace rows having 0 age with average age
df['c_age'] = df['c_age'].replace(0, average_age)
Confirm that there is no missing or invalid values for the Age feature
print 'More missing and invalid values for the Age feature:'
print '---------------------------------------------------'
any(df.c_age < 1)
# display properties
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
print df.dtypes
Randomly shows 5 rows of data for inspection after pre-processing
df.sample(5, random_state=999)
One-hot encoding will be applied to family history, the only categorical descriptive feature with values either 0 or 1. As a result, 2 new descriptive features, with family history
and without family history
will replace family history
.
One-hot encoding is used to convert column(s) with values of the same degree of importance into several columns of binary values to remove misinterpretation by some algorithms, e.g. the values of observed family history, 1 (yes) and 0 (no), are of the same importance.
# apply one-hot encoding for column family history
# create 2 new columns for family history
df['c_wfHistory'] = df['c_fHistory'] # with family history
df['c_wofHistory'] = 1 - df['c_wfHistory'] # without family history
df[['c_fHistory','c_wfHistory','c_wofHistory']].sample(5, random_state=990)
# drop Family History since it is now split into 2 columns
df = df.drop(['c_fHistory'], axis=1)
from sklearn import preprocessing
# perform min-max scaling on the descriptive features only
# extract the descriptive features
Data = df.drop(['class'], axis=1)
# keep a copy of the column names
Data_df = Data.copy()
Data_scaler = preprocessing.MinMaxScaler()
Data_scaler.fit(Data)
Data = Data_scaler.fit_transform(Data)
pd.DataFrame(Data, columns=Data_df.columns).sample(5, random_state=999)
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
# Extract the target feature
target = df['class']
testsize = 0.3
y = np.array(target)
X = np.array(Data)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, \
test_size=testsize)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
pipe_KNN = Pipeline([('fselector', SelectKBest(score_func=f_classif)),
('knn', KNeighborsClassifier())])
params_pipe_KNN = {'fselector__k': [10, 20, Data.shape[1]],
'knn__n_neighbors': range(1,5),
'knn__p': range(1,2)}
gs_pipe_KNN = GridSearchCV(pipe_KNN,
params_pipe_KNN,
cv=cv_method,
scoring='accuracy',
refit='accuracy',
verbose=1)
Fit data with pipeline, show best parameters and score
gs_pipe_KNN.fit(X_train, y_train)
gs_pipe_KNN.best_params_
gs_pipe_KNN.best_score_
Classification Report and Confusion Matrix
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_KNN.predict(X_test)
cr_KNN = classification_report(y_true, y_pred)
print cr_KNN
cm_Knn = confusion_matrix(y_test, y_pred)
print "\n", cm_Knn
cer_KNN = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_KNN.score(X_test, y_test))
print "\n", cer_KNN
from sklearn.tree import DecisionTreeClassifier
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Separate descriptive features and target feature
Data = df.drop(['class'], axis=1)
target = df['class']
pipe_DT = Pipeline([('fselector', SelectKBest(score_func=f_classif)),
('dt', DecisionTreeClassifier())])
params_pipe_DT = {'fselector__k': [10, 20, Data.shape[1]],
'dt__max_depth': range(1,5),
'dt__criterion': ['gini', 'entropy']}
gs_pipe_DT = GridSearchCV(pipe_DT,
params_pipe_DT,
cv=cv_method,
scoring='accuracy',
refit='accuracy',
verbose=1)
Fit data with pipeline, show best parameters and score
gs_pipe_DT.fit(X_train, y_train)
gs_pipe_DT.best_params_
gs_pipe_DT.best_score_
Classification Report and Confusion Matrix
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_DT.predict(X_test)
cr_DT = classification_report(y_true, y_pred)
print cr_DT
cm_DT = confusion_matrix(y_test, y_pred)
print "\n", cm_DT
cer_DT = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_DT.score(X_test, y_test))
print "\n", cer_DT
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import RepeatedStratifiedKFold
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Separate descriptive features and target feature
Data = df.drop(['class'], axis=1)
target = df['class']
pipe_RF = Pipeline([('fselector', SelectKBest(score_func=f_classif)),
('rf', RandomForestClassifier())])
params_pipe_RF = {'fselector__k': [10, 20, Data.shape[1]],
'rf__max_depth': range(1,5),
'rf__criterion': ['gini', 'entropy']}
gs_pipe_RF = GridSearchCV(pipe_RF,
params_pipe_RF,
cv=cv_method,
scoring='accuracy',
refit='accuracy',
verbose=1)
Fit data with pipeline, show best parameters and score
gs_pipe_RF.fit(X_train, y_train)
gs_pipe_RF.best_params_
gs_pipe_RF.best_score_
Classification Report and Confusion Matrix
print "Detailed classification report:","\n"
y_true, y_pred = y_test, gs_pipe_RF.predict(X_test)
cr_RF = classification_report(y_true, y_pred)
print cr_RF
cm_RF = confusion_matrix(y_test, y_pred)
print "\n", cm_RF
cer_RF = "classification error rate: {:.5f}"\
.format(1 - gs_pipe_RF.score(X_test, y_test))
print "\n", cer_RF
K-nearest Neighbors is the best model for the dermatology data set, producing the highest precision 0.96, recall 0.95 and f1-score 0.95.
import matplotlib.pyplot as plt
%matplotlib inline
crData = [[0.96, 0.93, 0.94], [0.95, 0.93, 0.94], [0.95, 0.94, 0.93]]
q1a = pd.DataFrame(crData, columns=['KNN','DTree','Random Forest'])
q1a.plot(kind='bar', fontsize=10)
plt.xlabel('\nType of Score', fontsize=12)
plt.title('Classification Report Analysis\n', fontsize=14)
plt.ylabel('Score', fontsize=12)
a = np.asarray([0,1,2])
labels = ['Precision', 'Recall', 'f1-score']
plt.xticks(a,labels,rotation='horizontal')
K-nearest Neighbors is the best model for the dermatology data set, having the lowest classification error rate 4.545%.
import matplotlib.pyplot as plt
%matplotlib inline
cerData = [[4.545, 6.364, 6.364]]
q1a = pd.DataFrame(cerData, columns=['KNN','DTree','Random Forest'])
q1a.unstack().plot(kind='bar', fontsize=10)
plt.xlabel('\nModels', fontsize=12)
plt.title('Classification Error Rate Analysis\n', fontsize=14)
plt.ylabel('Error %', fontsize=12)
a = np.asarray([0,1,2])
labels = ['KNN', 'Decision Tree', 'Random Forest']
plt.xticks(a,labels,rotation='horizontal')
from sklearn.model_selection import cross_val_score
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=999)
cv_results_KNN = cross_val_score(gs_pipe_KNN.best_estimator_,
X_test,
y_test,
cv=cv_method_ttest,
scoring='accuracy')
cv_results_KNN.mean().round(3)
cv_results_DT = cross_val_score(gs_pipe_DT.best_estimator_,
X_test,
y_test,
cv=cv_method_ttest,
scoring='accuracy')
cv_results_DT.mean().round(3)
cv_results_RF = cross_val_score(gs_pipe_RF.best_estimator_,
X_test,
y_test,
cv=cv_method_ttest,
scoring='accuracy')
cv_results_RF.mean().round(3)
from scipy import stats
print(stats.ttest_rel(cv_results_DT, cv_results_KNN).pvalue.round(3))
The p-value of the paired t-test is 0 i.e. < 0.05. Hence, at a 95% level, the difference between the Decision Tree model and the K-nearest Neighbors model is statistically significant. We can conclude that the K-nearest Neighbors model performs better than the Decision Tree model for the dermatology data set.
print(stats.ttest_rel(cv_results_DT, cv_results_RF).pvalue.round(3))
The p-value of the paired t-test is 0.023 i.e. < 0.05. Hence, at a 95% level, the difference between the Decision Tree model and the Random Forest model is statistically significant. We can conclude that the Random Forest model performs better than the Decision Tree model for the dermatology data set.
print(stats.ttest_rel(cv_results_RF, cv_results_KNN).pvalue.round(3))
The p-value of the paired t-test is 0.107 > 0.05. Hence, we conclude that, at a 95% level, the difference between the Random Forest model and the K-nearest Neighbors model is not statistically significant and the performance of these two classifiers are comparable for the dermatology data set.
The paired t-test results show that the Decision Tree model is the worst performing model while the Random Forest model is comparable to the K-nearest Neighbors model. Consequently, the Random Forest and the K-nearest Neighbors models are equal best for this dermatology data set.
Since there is only 366 rows of observation in the dermatology data set, this data set might not be able to represent the population. With such a small amount of data, I am also running the risk of overfitting or the final model has poor predictability for unseen data. Moreover, it is observed in phase I that more data is required for age 65 upwards to confirm the abnormality of increasing classes of dermatology from 65 to 70.
This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient.
Firstly, the dermatology data set will be cleaned, encoded and scaled.
Now that the data set is ready, it is being split into 2 data sets for Training (70%) and Testing (30%) purposes. Predictive models are then built upon the 3 binary classifiers K-nearest Neighbors, Decision Tree and Random Forest, are fine tuned and compared.
According to the classification reports, classification error rates, K-nearest Neighbors is the best model. However, paired t-tests result indicates that Random Forest and K-nearest Neighbors are equal best models for the dermatology data set as their difference in performance is statistically insignificant.