According to dermatology.names
file, "The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.
Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an anlysis of the sample under a miroscope."
This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient. The data sets were sourced from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/dermatology. This project has two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. I will present model building in Phase II. The rest of this report is organised as follows. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section presents a brief summary. Compiled from Jupyter Notebook, this report contains both narratives and the Python
codes used for data pre-processing and exploration.
The UCI Machine Learning Repository provides two data sets, dermatology.data
and dermatology.names
. dermatology.data contains 366 rows of data. There are altogether thirty-four attributes (descriptive features) in this database.
Thirty-three of them are linear valued from 0 to 3 where 0 indicates the feature does not present, 1 and 2 are the relative intermediate values, and 3 with the largest amount possible.
One of them is nominal with either 0 or 1 where a 1 indicates at least one of these diseases has been observed in the patient's family history and 0 is otherwise.
There is also one target feature with values from 1 to 6 each represent different class of dermatology.
The average value of the 358 known Age
values will replace the eight missing values for this attribute.
This dataset will be used for both training and testing data in Phase II when data models will be built with the best one selected for classifying the dermatology class of future patients.
Dermatology class is the target feature of this dataset summarized as follows.
class
: dermatology class code 1 to 6.There are 12 clinical and 22 histopathological attributes, altogether thirty-four descriptive features with description from the dermatology.names
file as follows.
Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated)
c_erythema
: erythema.c_scaling
: scaling.c_dBorders
: definite borders.c_itching
: itching.c_kPhenomenon
: koebner phenomenon.c_pPapules
: polygonal papules.c_fPapules
: follicular papules.c_omInvolvement
: oral mucosal involvement.c_kneInvolvement
: knee and elbow involvement.c_sInvolvement
: scalp involvement.c_fHistory
: family history, (0 or 1).c_age
: Age (linear).Histopathological Attributes: (take values 0, 1, 2, 3)
h_mIncontinence
: melanin incontinence.h_eitInfiltrate
: eosinophils in the infiltrate.h_pInfiltrate
: PNL infiltrate.h_fotpDermis
: fibrosis of the papillary dermis.h_exocytosis
: exocytosis.h_acanthosis
: acanthosis.h_hyperkeratosis
: hyperkeratosis.h_parakeratosis
: parakeratosis.h_cotrRidges
: clubbing of the rete ridges.h_eotrRidges
: elongation of the rete ridges.h_totsEpidermis
: thinning of the suprapapillary epidermis.h_sPustule
: spongiform pustule.h_mMicroabcess
: munro microabcess.h_fHypergranulosis
: focal hypergranulosis.h_dotgLayer
: disappearance of the granular layer.h_vndobLayer
: vacuolisation and damage of basal layer.h_spongiosis
: spongiosis.h_saoRetes
: saw-tooth appearance of retes.h_fhPlug
: follicular horn plug.h_pParakeratosis
: perifollicular parakeratosis.h_imInflitrate
: inflammatory monoluclear inflitrate.h_bInfiltrate
: band-like infiltrate.The data sets are downloaded from the URL for unzipping into the local machine. Since the data set dermatology.data
comes with headers, it will be loaded with headers as default. The data set dermatology.names
will be referenced to for the meaning of the attributes whenever necessary.
import pandas as pd
import numpy as np
# Read Dermatology CSV data from url
url="http://archive.ics.uci.edu/ml/machine-learning-databases/\
dermatology/dermatology.data"
df = pd.read_csv(url,header=None)
# Rename columns
df.columns = ['c_erythema','c_scaling','c_dBorders','c_itching',
'c_kPhenomenon','c_pPapules','c_fPapules','c_omInvolvement',
'c_kneInvolvement','c_sInvolvement','c_fHistory',
'h_mIncontinence','h_eitInfiltrate','h_pInfiltrate',
'h_fotpDermis','h_exocytosis','h_acanthosis',
'h_hyperkeratosis','h_parakeratosis','h_cotrRidges',
'h_eotrRidges','h_totsEpidermis','h_sPustule','h_mMicroabcess',
'h_fHypergranulosis','h_dotgLayer','h_vndobLayer','h_spongiosis',
'h_saoRetes','h_fhPlug','h_pParakeratosis','h_imInflitrate',
'h_bInfiltrate','c_age','class']
First, we confirmed that the feature types matched the description as outlined in the documentation.
# Check and change data type for each column if necessary
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
df.dtypes
# load column names
colnm = list(df)
# Display column statistics for sanity check
for i in range(0, len(colnm)):
print df[colnm[i]].value_counts().sort_index(), '\n'
There is no invalid value for all attributes except the Age (c_age) which has
Therefore, these values will be cleaned and replaced with the average of the known age as follows.
# 1. replace rows having "?" in the Age feature with 0, and,
# change data type of the Age feature to integers
df['c_age'] = df['c_age'].str.replace("?","0").astype(int)
# 2. calculate average age for rows with age > 0
total_age = 0
cnt = 0
for index, row in df.iterrows():
if (row['c_age'] > 0):
total_age = total_age + row['c_age']
cnt = cnt + 1
average_age = int(round(total_age / cnt))
# 3. replace rows having 0 age with average age
df['c_age'] = df['c_age'].replace(0, average_age)
Confirm that there is no missing or invalid values for the Age feature
print 'More missing and invalid values for the Age feature:'
print '---------------------------------------------------'
any(df.c_age < 1)
# display properties
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
print df.dtypes
Show first 3 rows of data after pre-processing
df.head(3)
Show last 3 rows of data after pre-processing
df.tail(3)
There is no outlier in this database.
There are altogether 34 descriptive and 1 target features including 32 ordinal, 1 linear and 2 categorical. Histogram is used on the single linear feature because it helps to reveal the shape of the underlying distribution. Barchart is used for the ordinal and categorical features as it can depict the proportions of each category.
matplotlib library is used to create these plots.
Firstly, define a table for column descriptions
# column name
col_desc = { 'colName': ['c_erythema','c_scaling','c_dBorders','c_itching',
'c_kPhenomenon','c_pPapules','c_fPapules',
'c_omInvolvement','c_kneInvolvement','c_sInvolvement',
'c_fHistory','h_mIncontinence','h_eitInfiltrate',
'h_pInfiltrate','h_fotpDermis','h_exocytosis',
'h_acanthosis','h_hyperkeratosis','h_parakeratosis',
'h_cotrRidges','h_eotrRidges','h_totsEpidermis',
'h_sPustule','h_mMicroabcess','h_fHypergranulosis',
'h_dotgLayer','h_vndobLayer','h_spongiosis','h_saoRetes',
'h_fhPlug','h_pParakeratosis','h_imInflitrate',
'h_bInfiltrate','c_age','class'],
'colDesc': ['erythema','scaling','definite borders','itching',
'koebner phenomenon','polygonal papules',
'follicular papules','oral mucosal involvement',
'knee and elbow involvement','scalp involvement',
'family history, (0 or 1)','melanin incontinence',
'eosinophils in the infiltrate','PNL infiltrate',
'fibrosis of the papillary dermis','exocytosis',
'acanthosis','hyperkeratosis','parakeratosis',
'clubbing of the rete ridges',
'elongation of the rete ridges',
'thinning of the suprapapillary epidermis',
'spongiform pustule','munro microabcess',
'focal hypergranulosis',
'disappearance of the granular layer',
'vacuolisation and damage of basal layer','spongiosis',
'saw-tooth appearance of retes','follicular horn plug',
'perifollicular parakeratosis',
'inflammatory monoluclear inflitrate',
'band-like infiltrate','Age (linear)',
'class of dermatology']}
dfColDesc = pd.DataFrame(col_desc)
import matplotlib.pyplot as plt
# load column names
colnm = list(df)
# Display column statistics for sanity check
for i in range(0, len(colnm)):
title = dfColDesc.iloc[i,0] + '\n'
if colnm[i] != 'c_age':
df[colnm[i]].value_counts().sort_index().plot(kind='bar', fontsize=10)
if colnm[i] != 'class':
plt.xlabel('Relative Amount of Presence', fontsize=10)
else:
plt.xlabel('', fontsize=10)
else:
df[colnm[i]].plot(kind='hist', bins=16)
plt.xlabel('years old', fontsize=10)
plt.title(title, fontsize=14)
plt.ylabel('Frequency', fontsize=10)
plt.show()
These plots reveal that although a patient has been diagnosed with an erythemato-squamous disease, the patient's family usually does not observe with any of these diseases. Moreover, majority of the result shows the 11 clinical features do not present except for erythema, scaling and definite borders.
Hence, further diagnosis is performed with skin samples under a microscope for the presence of histopathological features. Nevertheless, the results are similar, majority of the diagnostics shows histopathological features do not present except for acanthosis, parakeratosis and inflammatory monoluclear inflitrate.
In addition, people between 20 and 50 years old are more likely to contract these diseases (see histogram for Age) with Psoriasis being the most common class of dermatology.
from pandas.plotting import scatter_matrix
scatter_matrix(df,alpha=0.2,figsize=(10,10),diagonal='hist')
plt.suptitle('Scatter matrix for all numerical attributes')
plt.show()
There will be 3 pair-wise scatter plots, one each for,
import seaborn as sns
# load column names
cols = list(df)
# Prepare Clinical Attributes for pair-wise Scatter Plots
cols_clinical = cols[0:12]
cols_clinical[11] = cols[33]
# Pair-wise Scatter Plots
pp = sns.pairplot(df[cols_clinical], size=1.5, aspect=1.5,
plot_kws=dict(edgecolor="k", linewidth=0.4),
diag_kind="kde", diag_kws=dict(shade=True))
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.2)
t = fig.suptitle('Clinical Attributes Pairwise Plots', fontsize=40)
plt.show()
# load column names
cols = list(df)
# Prepare Clinical Attributes for pair-wise Scatter Plots
cols_clinical = cols[11:34]
cols_clinical[22] = cols[34]
cols_clinical
# Pair-wise Scatter Plots
pp = sns.pairplot(df[cols_clinical], size=1.5, aspect=1.5,
plot_kws=dict(edgecolor="k", linewidth=0.4),
diag_kind="kde", diag_kws=dict(shade=True))
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.2)
t = fig.suptitle('Histopathological Attributes Pairwise Plots', fontsize=60)
plt.show()
# load column names
colnm = list(df)
# Scatter plot for "Age vs class of dermatology"
title = dfColDesc.iloc[33,0] + ' vs ' + dfColDesc.iloc[34,0] + '\n'
plt.scatter(df['c_age'], df['class'])
plt.title('Age vs class of dermatology', fontsize=20)
plt.ylabel(dfColDesc.iloc[34,0], fontsize=16)
plt.xlabel(dfColDesc.iloc[33,0], fontsize=16)
plt.show()
Interesting observations for the plot - Age vs class of dermatology.
In Phase 1, the data set is cleaned of missing values.
From the data exploration, the plots from univariate visualisation reveal that although a patient has been diagnosed with an erythemato-squamous disease, the patient's family usually does not observe with any of these diseases. Moreover, majority of the result shows the 11 clinical features do not present except for erythema, scaling and definite borders.
Hence, further diagnosis is performed with skin samples under a microscope for the presence of histopathological features. Nevertheless, the results are similar, majority of the diagnostics shows histopathological features do not present except for acanthosis, parakeratosis and inflammatory monoluclear inflitrate.
In addition, people between 20 and 50 years old are more likely to contract these diseases (see histogram for Age) with Psoriasis being the most common class of dermatology.
Due to large number of ordinal attributes, scatter matrix and scatter plots are quite crowded. Nevertheless, I discover some interesting facts from the multivariate visualisation of "Age vs class of dermatology". They are as follows.