Introduction

The Story

According to dermatology.names file, "The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.

Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an anlysis of the sample under a miroscope."

Objective

This project is to determine the type of Eryhemato-Squamous Disease i.e. the class of dermatology based on the values of both the clinical and histopathological features of the patient. The data sets were sourced from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/dermatology. This project has two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. I will present model building in Phase II. The rest of this report is organised as follows. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section presents a brief summary. Compiled from Jupyter Notebook, this report contains both narratives and the Python codes used for data pre-processing and exploration.

Data Sets

The UCI Machine Learning Repository provides two data sets, dermatology.data and dermatology.names. dermatology.data contains 366 rows of data. There are altogether thirty-four attributes (descriptive features) in this database.

Thirty-three of them are linear valued from 0 to 3 where 0 indicates the feature does not present, 1 and 2 are the relative intermediate values, and 3 with the largest amount possible.

One of them is nominal with either 0 or 1 where a 1 indicates at least one of these diseases has been observed in the patient's family history and 0 is otherwise.

There is also one target feature with values from 1 to 6 each represent different class of dermatology.

The average value of the 358 known Age values will replace the eight missing values for this attribute.

This dataset will be used for both training and testing data in Phase II when data models will be built with the best one selected for classifying the dermatology class of future patients.

Target Feature

Dermatology class is the target feature of this dataset summarized as follows.

  • 1 - psoriasis (112 instances)
  • 2 - seboreic dermatitis (61)
  • 3 - lichen planus (72)
  • 4 - pityriasis rosea (49)
  • 5 - cronic dermatitis (52)
  • 6 - pityriasis rubra pilaris (20)
  • class: dermatology class code 1 to 6.

Descriptive Features

There are 12 clinical and 22 histopathological attributes, altogether thirty-four descriptive features with description from the dermatology.names file as follows.

Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated)

  • c_erythema: erythema.
  • c_scaling: scaling.
  • c_dBorders: definite borders.
  • c_itching: itching.
  • c_kPhenomenon: koebner phenomenon.
  • c_pPapules: polygonal papules.
  • c_fPapules: follicular papules.
  • c_omInvolvement: oral mucosal involvement.
  • c_kneInvolvement: knee and elbow involvement.
  • c_sInvolvement: scalp involvement.
  • c_fHistory: family history, (0 or 1).
  • c_age: Age (linear).

Histopathological Attributes: (take values 0, 1, 2, 3)

  • h_mIncontinence: melanin incontinence.
  • h_eitInfiltrate: eosinophils in the infiltrate.
  • h_pInfiltrate: PNL infiltrate.
  • h_fotpDermis: fibrosis of the papillary dermis.
  • h_exocytosis: exocytosis.
  • h_acanthosis: acanthosis.
  • h_hyperkeratosis: hyperkeratosis.
  • h_parakeratosis: parakeratosis.
  • h_cotrRidges: clubbing of the rete ridges.
  • h_eotrRidges: elongation of the rete ridges.
  • h_totsEpidermis: thinning of the suprapapillary epidermis.
  • h_sPustule: spongiform pustule.
  • h_mMicroabcess: munro microabcess.
  • h_fHypergranulosis: focal hypergranulosis.
  • h_dotgLayer: disappearance of the granular layer.
  • h_vndobLayer: vacuolisation and damage of basal layer.
  • h_spongiosis: spongiosis.
  • h_saoRetes: saw-tooth appearance of retes.
  • h_fhPlug: follicular horn plug.
  • h_pParakeratosis: perifollicular parakeratosis.
  • h_imInflitrate: inflammatory monoluclear inflitrate.
  • h_bInfiltrate: band-like infiltrate.

Data Pre-processsing

Preliminaries

The data sets are downloaded from the URL for unzipping into the local machine. Since the data set dermatology.data comes with headers, it will be loaded with headers as default. The data set dermatology.names will be referenced to for the meaning of the attributes whenever necessary.

In [1]:
import pandas as pd
import numpy as np

# Read Dermatology CSV data from url
url="http://archive.ics.uci.edu/ml/machine-learning-databases/\
dermatology/dermatology.data"
df = pd.read_csv(url,header=None)

# Rename columns
df.columns = ['c_erythema','c_scaling','c_dBorders','c_itching',
              'c_kPhenomenon','c_pPapules','c_fPapules','c_omInvolvement',
              'c_kneInvolvement','c_sInvolvement','c_fHistory',
              'h_mIncontinence','h_eitInfiltrate','h_pInfiltrate',
              'h_fotpDermis','h_exocytosis','h_acanthosis',
              'h_hyperkeratosis','h_parakeratosis','h_cotrRidges',
              'h_eotrRidges','h_totsEpidermis','h_sPustule','h_mMicroabcess',
              'h_fHypergranulosis','h_dotgLayer','h_vndobLayer','h_spongiosis',
              'h_saoRetes','h_fhPlug','h_pParakeratosis','h_imInflitrate',
              'h_bInfiltrate','c_age','class']

Data Cleaning and Transformation

First, we confirmed that the feature types matched the description as outlined in the documentation.

In [2]:
# Check and change data type for each column if necessary
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
df.dtypes
Dimensions of data set:
(366, 35)

Data types of data set:
Out[2]:
c_erythema             int64
c_scaling              int64
c_dBorders             int64
c_itching              int64
c_kPhenomenon          int64
c_pPapules             int64
c_fPapules             int64
c_omInvolvement        int64
c_kneInvolvement       int64
c_sInvolvement         int64
c_fHistory             int64
h_mIncontinence        int64
h_eitInfiltrate        int64
h_pInfiltrate          int64
h_fotpDermis           int64
h_exocytosis           int64
h_acanthosis           int64
h_hyperkeratosis       int64
h_parakeratosis        int64
h_cotrRidges           int64
h_eotrRidges           int64
h_totsEpidermis        int64
h_sPustule             int64
h_mMicroabcess         int64
h_fHypergranulosis     int64
h_dotgLayer            int64
h_vndobLayer           int64
h_spongiosis           int64
h_saoRetes             int64
h_fhPlug               int64
h_pParakeratosis       int64
h_imInflitrate         int64
h_bInfiltrate          int64
c_age                 object
class                  int64
dtype: object

Display column statistics for

  • sanity check, and,
  • better understanding of the data and its distribution characteristics.
In [3]:
# load column names
colnm = list(df)

# Display column statistics for sanity check
for i in range(0, len(colnm)):
    print df[colnm[i]].value_counts().sort_index(), '\n'
0      4
1     57
2    215
3     90
Name: c_erythema, dtype: int64 

0      8
1    111
2    195
3     52
Name: c_scaling, dtype: int64 

0     59
1     93
2    168
3     46
Name: c_dBorders, dtype: int64 

0    118
1     72
2    100
3     76
Name: c_itching, dtype: int64 

0    224
1     70
2     54
3     18
Name: c_kPhenomenon, dtype: int64 

0    297
1      1
2     41
3     27
Name: c_pPapules, dtype: int64 

0    333
1     11
2     16
3      6
Name: c_fPapules, dtype: int64 

0    299
1      9
2     45
3     13
Name: c_omInvolvement, dtype: int64 

0    251
1     28
2     64
3     23
Name: c_kneInvolvement, dtype: int64 

0    264
1     30
2     56
3     16
Name: c_sInvolvement, dtype: int64 

0    320
1     46
Name: c_fHistory, dtype: int64 

0    296
1      8
2     46
3     16
Name: h_mIncontinence, dtype: int64 

0    324
1     33
2      9
Name: h_eitInfiltrate, dtype: int64 

0    235
1     69
2     55
3      7
Name: h_pInfiltrate, dtype: int64 

0    312
1      8
2     23
3     23
Name: h_fotpDermis, dtype: int64 

0    118
1     57
2    129
3     62
Name: h_exocytosis, dtype: int64 

0     10
1     71
2    210
3     75
Name: h_acanthosis, dtype: int64 

0    227
1     90
2     44
3      5
Name: h_hyperkeratosis, dtype: int64 

0     86
1    118
2    132
3     30
Name: h_parakeratosis, dtype: int64 

0    252
1     19
2     61
3     34
Name: h_cotrRidges, dtype: int64 

0    198
1     23
2     95
3     50
Name: h_eotrRidges, dtype: int64 

0    256
1     19
2     60
3     31
Name: h_totsEpidermis, dtype: int64 

0    296
1     38
2     26
3      6
Name: h_sPustule, dtype: int64 

0    286
1     37
2     33
3     10
Name: h_mMicroabcess, dtype: int64 

0    295
1     13
2     43
3     15
Name: h_fHypergranulosis, dtype: int64 

0    273
1     30
2     49
3     14
Name: h_dotgLayer, dtype: int64 

0    294
1      3
2     43
3     26
Name: h_vndobLayer, dtype: int64 

0    199
1     28
2     96
3     43
Name: h_spongiosis, dtype: int64 

0    294
1      5
2     40
3     27
Name: h_saoRetes, dtype: int64 

0    344
1     10
2      8
3      4
Name: h_fhPlug, dtype: int64 

0    345
1      4
2     13
3      4
Name: h_pParakeratosis, dtype: int64 

0     13
1     85
2    206
3     62
Name: h_imInflitrate, dtype: int64 

0    289
1      3
2     22
3     52
Name: h_bInfiltrate, dtype: int64 

0      1
10     7
12     3
13     2
15     2
16     5
17     5
18     9
19     6
20     8
21     3
22    15
23     3
24     2
25    14
26     3
27    16
28     5
29     3
30    13
31     2
32     6
33    12
34     8
35    14
36    16
37     2
38     3
39     2
40    17
      ..
42    10
43     4
44     5
45     7
46     6
47     6
48     5
49     1
50    17
51     7
52    11
53     2
55    14
56     5
57     2
58     1
60    11
61     2
62     7
63     1
64     1
65     2
67     1
68     1
7      4
70     4
75     1
8      7
9      2
?      8
Name: c_age, Length: 61, dtype: int64 

1    112
2     61
3     72
4     49
5     52
6     20
Name: class, dtype: int64 

Summary of Statistics

There is no invalid value for all attributes except the Age (c_age) which has

  • eight rows with value "?", and,
  • one row with value 0.

Missing and Invalid Values Handling

Therefore, these values will be cleaned and replaced with the average of the known age as follows.

  • replace rows having "?" in the Age feature with 0 and change data type of the Age feature to integer
  • calculate average age for rows with age > 0.
  • replace rows having 0 age with average age.
In [4]:
# 1. replace rows having "?" in the Age feature with 0, and,
#    change data type of the Age feature to integers
df['c_age'] = df['c_age'].str.replace("?","0").astype(int)

# 2. calculate average age for rows with age > 0
total_age = 0
cnt = 0
for index, row in df.iterrows():
    if (row['c_age'] > 0):
        total_age = total_age + row['c_age']
        cnt = cnt + 1
average_age = int(round(total_age / cnt))

# 3. replace rows having 0 age with average age
df['c_age'] = df['c_age'].replace(0, average_age)

Confirm that there is no missing or invalid values for the Age feature

In [5]:
print 'More missing and invalid values for the Age feature:'
print '---------------------------------------------------'
any(df.c_age < 1)
More missing and invalid values for the Age feature:
---------------------------------------------------
Out[5]:
False

Data after pre-processing

In [6]:
# display properties
print "Dimensions of data set:"
print df.shape
print "\nData types of data set:"
print df.dtypes
Dimensions of data set:
(366, 35)

Data types of data set:
c_erythema            int64
c_scaling             int64
c_dBorders            int64
c_itching             int64
c_kPhenomenon         int64
c_pPapules            int64
c_fPapules            int64
c_omInvolvement       int64
c_kneInvolvement      int64
c_sInvolvement        int64
c_fHistory            int64
h_mIncontinence       int64
h_eitInfiltrate       int64
h_pInfiltrate         int64
h_fotpDermis          int64
h_exocytosis          int64
h_acanthosis          int64
h_hyperkeratosis      int64
h_parakeratosis       int64
h_cotrRidges          int64
h_eotrRidges          int64
h_totsEpidermis       int64
h_sPustule            int64
h_mMicroabcess        int64
h_fHypergranulosis    int64
h_dotgLayer           int64
h_vndobLayer          int64
h_spongiosis          int64
h_saoRetes            int64
h_fhPlug              int64
h_pParakeratosis      int64
h_imInflitrate        int64
h_bInfiltrate         int64
c_age                 int32
class                 int64
dtype: object

Show first 3 rows of data after pre-processing

In [7]:
df.head(3)
Out[7]:
c_erythema c_scaling c_dBorders c_itching c_kPhenomenon c_pPapules c_fPapules c_omInvolvement c_kneInvolvement c_sInvolvement ... h_dotgLayer h_vndobLayer h_spongiosis h_saoRetes h_fhPlug h_pParakeratosis h_imInflitrate h_bInfiltrate c_age class
0 2 2 0 3 0 0 0 0 1 0 ... 0 0 3 0 0 0 1 0 55 2
1 3 3 3 2 1 0 0 0 1 1 ... 0 0 0 0 0 0 1 0 8 1
2 2 1 2 3 1 3 0 3 0 0 ... 0 2 3 2 0 0 2 3 26 3

3 rows × 35 columns

Show last 3 rows of data after pre-processing

In [8]:
df.tail(3)
Out[8]:
c_erythema c_scaling c_dBorders c_itching c_kPhenomenon c_pPapules c_fPapules c_omInvolvement c_kneInvolvement c_sInvolvement ... h_dotgLayer h_vndobLayer h_spongiosis h_saoRetes h_fhPlug h_pParakeratosis h_imInflitrate h_bInfiltrate c_age class
363 3 2 2 2 3 2 0 2 0 0 ... 0 3 0 3 0 0 2 3 28 3
364 2 1 3 1 2 3 0 2 0 0 ... 0 2 0 1 0 0 2 3 50 3
365 3 2 2 0 0 0 0 0 3 3 ... 2 0 0 0 0 0 3 0 35 1

3 rows × 35 columns

Data Exploration

Outliers

There is no outlier in this database.

Univariate Visualization

There are altogether 34 descriptive and 1 target features including 32 ordinal, 1 linear and 2 categorical. Histogram is used on the single linear feature because it helps to reveal the shape of the underlying distribution. Barchart is used for the ordinal and categorical features as it can depict the proportions of each category.

matplotlib library is used to create these plots.

Firstly, define a table for column descriptions

In [9]:
# column name
col_desc = { 'colName': ['c_erythema','c_scaling','c_dBorders','c_itching',
                         'c_kPhenomenon','c_pPapules','c_fPapules',
                         'c_omInvolvement','c_kneInvolvement','c_sInvolvement',
                         'c_fHistory','h_mIncontinence','h_eitInfiltrate',
                         'h_pInfiltrate','h_fotpDermis','h_exocytosis',
                         'h_acanthosis','h_hyperkeratosis','h_parakeratosis',
                         'h_cotrRidges','h_eotrRidges','h_totsEpidermis',
                         'h_sPustule','h_mMicroabcess','h_fHypergranulosis',
                         'h_dotgLayer','h_vndobLayer','h_spongiosis','h_saoRetes',
                         'h_fhPlug','h_pParakeratosis','h_imInflitrate',
                         'h_bInfiltrate','c_age','class'],
             'colDesc': ['erythema','scaling','definite borders','itching',
                         'koebner phenomenon','polygonal papules',
                         'follicular papules','oral mucosal involvement',
                         'knee and elbow involvement','scalp involvement',
                         'family history, (0 or 1)','melanin incontinence',
                         'eosinophils in the infiltrate','PNL infiltrate',
                         'fibrosis of the papillary dermis','exocytosis',
                         'acanthosis','hyperkeratosis','parakeratosis',
                         'clubbing of the rete ridges',
                         'elongation of the rete ridges',
                         'thinning of the suprapapillary epidermis',
                         'spongiform pustule','munro microabcess',
                         'focal hypergranulosis',
                         'disappearance of the granular layer',
                         'vacuolisation and damage of basal layer','spongiosis',
                         'saw-tooth appearance of retes','follicular horn plug',
                         'perifollicular parakeratosis',
                         'inflammatory monoluclear inflitrate',
                         'band-like infiltrate','Age (linear)',
                         'class of dermatology']}
dfColDesc = pd.DataFrame(col_desc)
In [10]:
import matplotlib.pyplot as plt
In [11]:
# load column names
colnm = list(df)

# Display column statistics for sanity check
for i in range(0, len(colnm)):
    title = dfColDesc.iloc[i,0] + '\n'
    if colnm[i] != 'c_age':
        df[colnm[i]].value_counts().sort_index().plot(kind='bar', fontsize=10)
        if colnm[i] != 'class':
            plt.xlabel('Relative Amount of Presence', fontsize=10)
        else:
            plt.xlabel('', fontsize=10)
    else:
        df[colnm[i]].plot(kind='hist', bins=16)
        plt.xlabel('years old', fontsize=10)
    plt.title(title, fontsize=14)
    plt.ylabel('Frequency', fontsize=10)
    plt.show()

Univariate Visualization Summary

These plots reveal that although a patient has been diagnosed with an erythemato-squamous disease, the patient's family usually does not observe with any of these diseases. Moreover, majority of the result shows the 11 clinical features do not present except for erythema, scaling and definite borders.

Hence, further diagnosis is performed with skin samples under a microscope for the presence of histopathological features. Nevertheless, the results are similar, majority of the diagnostics shows histopathological features do not present except for acanthosis, parakeratosis and inflammatory monoluclear inflitrate.

In addition, people between 20 and 50 years old are more likely to contract these diseases (see histogram for Age) with Psoriasis being the most common class of dermatology.

Multivariate Visualisation

Scatter Matrix for all numeric attributes

As there are too many attributes with ordinal data, this plot would be quite crowded

In [12]:
from pandas.plotting import scatter_matrix

scatter_matrix(df,alpha=0.2,figsize=(10,10),diagonal='hist')
plt.suptitle('Scatter matrix for all numerical attributes')
plt.show()

Pair-wise Scatter Plots

There will be 3 pair-wise scatter plots, one each for,

  • Clinical Attributes,
  • Histopathological Attributes, and,
  • between Age and class of dermatology.
In [13]:
import seaborn as sns

# load column names
cols = list(df)

# Prepare Clinical Attributes for pair-wise Scatter Plots
cols_clinical = cols[0:12]
cols_clinical[11] = cols[33]

# Pair-wise Scatter Plots
pp = sns.pairplot(df[cols_clinical], size=1.5, aspect=1.5,
                  plot_kws=dict(edgecolor="k", linewidth=0.4),
                  diag_kind="kde", diag_kws=dict(shade=True))
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.2)
t = fig.suptitle('Clinical Attributes Pairwise Plots', fontsize=40)
plt.show()
In [14]:
# load column names
cols = list(df)

# Prepare Clinical Attributes for pair-wise Scatter Plots
cols_clinical = cols[11:34]
cols_clinical[22] = cols[34]
cols_clinical

# Pair-wise Scatter Plots
pp = sns.pairplot(df[cols_clinical], size=1.5, aspect=1.5,
                  plot_kws=dict(edgecolor="k", linewidth=0.4),
                  diag_kind="kde", diag_kws=dict(shade=True))
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.2)
t = fig.suptitle('Histopathological Attributes Pairwise Plots', fontsize=60)
plt.show()
In [16]:
# load column names
colnm = list(df)

# Scatter plot for "Age vs class of dermatology"
title = dfColDesc.iloc[33,0] + ' vs ' + dfColDesc.iloc[34,0] + '\n'
plt.scatter(df['c_age'], df['class'])
plt.title('Age vs class of dermatology', fontsize=20)
plt.ylabel(dfColDesc.iloc[34,0], fontsize=16)
plt.xlabel(dfColDesc.iloc[33,0], fontsize=16)
plt.show()

Multivariate Visualization Summary

Interesting observations for the plot - Age vs class of dermatology.

  • All 6 classes of dermatology appears in the 22 years old bucket.
  • No patient older than 22 will have class 6 - pityriasis rubra pilaris.
  • Some class of dermatology does not appear until a patient becomes older e.g. 16 for class 3 - lichen planus
  • Number of classes of dermatology diminishes when patients is 65 but grows to 4 classes again at the age of 70. This indicates more data for age 65 onwards might be required.

Summary

In Phase 1, the data set is cleaned of missing values.

From the data exploration, the plots from univariate visualisation reveal that although a patient has been diagnosed with an erythemato-squamous disease, the patient's family usually does not observe with any of these diseases. Moreover, majority of the result shows the 11 clinical features do not present except for erythema, scaling and definite borders.

Hence, further diagnosis is performed with skin samples under a microscope for the presence of histopathological features. Nevertheless, the results are similar, majority of the diagnostics shows histopathological features do not present except for acanthosis, parakeratosis and inflammatory monoluclear inflitrate.

In addition, people between 20 and 50 years old are more likely to contract these diseases (see histogram for Age) with Psoriasis being the most common class of dermatology.

Due to large number of ordinal attributes, scatter matrix and scatter plots are quite crowded. Nevertheless, I discover some interesting facts from the multivariate visualisation of "Age vs class of dermatology". They are as follows.

  • All 6 classes of dermatology appears in the 22 years old bucket.
  • No patient older than 22 will have class 6 - pityriasis rubra pilaris.
  • Some class of dermatology does not appear until a patient becomes older e.g. 16 for class 3 - lichen planus
  • Number of classes of dermatology diminishes when patients is 65 but grows to 4 classes again at the age of 70. This indicates more data for age 65 onwards might be required.