Document generation date: 2015-09-18 12:04:47

Executive summary

This document presents the process of obtaining analytic data used for training and evaluation of effectivness of different aggregation operators and thresholding strategies in medical diagnosis support under data incompleteness.

Introduction

Suppose we have a patient with a diagnosed ovarian tumor. Such a patient can be described by a set of features. The physicians have developed many predictive models, which on the base of the features can predict whether the tumor is malignant or benign. The models assume working on a complete set of the features but frequently some of them are missing. A novel approach is to adjust the models to work with incomplete data and summarize the results from different models. We want to study the performance of different aggregation strategies of such models under different level of data incompleteness. In this document we present a process of obtaining such data for further evaluation.

The outline of the process is as follow. In the first phase, we sample patients from two groups of malignancy. Then we randomly obscure (remove) values of their features. Afterwards we compute the results of the adjusted models on the obscured data. We repeat whole procedure many times for each level of data incompleteness. This training set will be used to optimize aggregation operators and thresholding strategies. In the second phase, we compute the results of the adjusted models on the dataset, where incompleteness is an effect of a physician’s decision. This test set will be used to evaluate of the optimized aggregation stretegies.

For sake of readibility, some code is externalized into separate R files.

rm(list=ls())

source("config.R")
source("methods.R")
source("utils.R")

library(dplyr)
library(sampling)

Overview of supplementary files

Initial configuration

An initial setup for the process is stored in config.R file. Besides setting a seed for random procedures (SEED constant) and setting constants for input/output files and directories (*.DIR/*.FILE/*.LOCATION constants), there are several values which are important for the building process.

An input database will be splitted into two parts: the first will be used for training and the second will be used for testing. The first will be size of TRAINING.SIZE = \(200\) patients.

The PROBE.SIZE.NEGATIVE = \(75\) and PROBE.SIZE.POSITIVE = \(75\) constants are the sizes of samples of patients with benign and malignant tumors, respectively. The patients’ attributes for this samples will be obscured during the construcion of the training set.

The OBSCURE.PERCENTAGES vector consists of considered levels of obscuration of the attributes in the simulation step. The values vary from \(0\) to \(0.5\).

The OBSCURE.MAX = \(0.5\) constant indicates the maximum level of obscuration in the simulation step and maximum level of data incompleteness in the evaluation set.

The OBSUCRE.REPEAT = \(1000\) constant is a number of repeats of random attributes obscuration in the simulation step.

Uncertaintified diagnostic models

All uncertaintyfied diagnostic models are implemented in methods.R file. There are \(6\) models stored in METHODS vector. Each of them for any patient (regardless of missing data) produces an interval contained in \([0,1]\). This interval should be interpreted as a set of all possible diagnoses which can be returned by this model for a given patient. Due to missing data in patient’s description, the length of the interval grows. Single point interval indicates no missing data in the patient’s description. Values below \(0.5\) indicate that a tumor is benign; values above or equal to \(0.5\) indicate that a tumor is malignant.

The METHODS.NAME vector consists of short printable names for the models. The METHODS.COL vector contains names of patients’ attributes used by a model.

The COLS.ALL vector is an unique set of all patients’ attributes used by the models. The COLS.SURE vector has names of the attributes which are always available and will not be obscured in the simulation step. The COLS.OBSC has names of the attributes which for a given patient might be incomplete; these attributes will be obscured during the construcion of the training set.

The COLS.SURE vector looks as follows:

print(COLS.SURE)
[1] "HormoneReplacementTherapy" "Age"                       "PainAtExamination"        
[4] "AgeAfterMenopause"         "UterusRemoved"            

Auxiliary functions

Debugging functions are implemented in utils.R file.

Database

The following subsections describes the process of loading the data with patients, preprocessing and division into the training and test sets.

Loading the data

The original patient database is stored in a CSV file. Due to legal restrictions, this database cannot be published. The main objective of this document is to document the process of building the analytic dataset from this undisclosed database.

db = read.csv(DATABASE.LOCATION, header=TRUE)
db = db %>%
     select(GivenId, MalignancyCharacter, one_of(COLS.ALL)) %>%
     rename(PatientId=GivenId)

The initial database contains \(27\) variables and \(441\) cases. The basic structure of this database is following (observe that only important columns for this research are presented):

dbSummary = t(sapply(select(db, -PatientId), 
                function(x){c(
                min=min(x, na.rm=T), 
                median=median(x, na.rm=T), 
                max=max(x, na.rm=T),
                NAs=sum(is.na(x))
             )}))
print(round(dbSummary, 2))
                            min median    max NAs
MalignancyCharacter        0.00   0.00    2.0  12
OvarianCancerInFamily      0.00   0.00    1.0  88
HormoneReplacementTherapy  0.00   0.00    1.0  26
Age                       13.00  48.00   84.0   0
ADimension                10.00  79.00  280.0  13
PainAtExamination          0.00   0.00    1.0   0
Ascites                    0.00   0.00    1.0  21
PapBloodFlow               0.00   0.00    1.0 201
Solid                      0.00   0.00    1.0  92
ASolidDimension            0.00  20.00  300.0 101
InnerWall                  0.00   1.00    1.0   0
Shadow                     0.00   0.00    1.0 114
Color                      1.00   2.00    4.0 103
Septum                     0.00   1.00    1.0  37
SmEchogenicity             0.00   3.00    4.0  36
Location                   0.00   0.00    2.0  35
SmInnerWallThickness       0.00   1.00    2.0  41
TumorVolume                0.52 142.44 4186.4  23
Pap                        0.00   1.00    1.0  51
APapDimension              0.00   2.00   51.0 177
SeptumThickness            0.00   2.00   11.0 153
AgeAfterMenopause          0.00   0.00   38.0   0
Ca125                      0.53  39.53 6326.0  68
Ri                         0.00   0.56    1.0 249
UterusRemoved              0.00   0.00    1.0  24
IotaQuality                1.00   4.00    6.0  28

The MalignancyCharacter feature indicates whether a patient has a benign (\(0\)), malignant (\(1\)) or borderline (\(2\)) tumor. A few attributes mentioned in COLS.SURE have missing values - this is caused by the fact that at the time of data collection physician did not know that such information should be obtained from patients.

The sonographic attributes are described in the paper: D. Timmerman, L. Valentin, et al (2000). Terms, definitions and measurements to describe the sonographic features of adnexal tumors: a consensus opinion from the International Ovarian Tumor Analysis (IOTA) Group. Ultrasound in Obstetrics & Gynecology, 16(5), 500-505.

Preprocessing

In the first step we combine malignant with borderline tumors.

Secondly, the database does not contain some values which can be inferred from other ones. An example of such value is a dimension of papilary projections (APapDimension), which is missing in case of no papilary projections in the tumor (Pap==0). Such values are inserted in database.

Missing Ri values are inserted either when a physician decided not to perform this examination due to its obvious result or in case of avascular endometriomas.

db[with(db, which(MalignancyCharacter == 2)), "MalignancyCharacter"] = 1

db[with(db, which(is.na(PapBloodFlow)    & Pap == 0)),    "PapBloodFlow"]    = 0
db[with(db, which(is.na(APapDimension)   & Pap == 0)),    "APapDimension"]   = 0
db[with(db, which(is.na(SeptumThickness) & Septum == 0)), "SeptumThickness"] = 0
db[with(db, which(grepl("^Sms", PatientId) & is.na(Ri))), "Ri"] = 1
db[with(db, which(grepl("^Sz",  PatientId) & is.na(Ri) & Color == 1)), "Ri"] = 1

Division into training and test sets

The database is splitted into two sets:

  1. db.training - it consists of \(200\) patients with the complete attributes; this data will be used to construct the training set,
  2. db.test - it consists of patients with no more than \(50\%\) incomplete attribues and a few patients with complete attributes; this data will be used to construct the test set.

The choose of the patients with complete attributes is done by use of a stratified simple random sampling without replacement.

db.completeCases   = filter(db, complete.cases(db))
db.incompleteCases = filter(db, !complete.cases(db) &
                                complete.cases(db[, c("MalignancyCharacter", COLS.SURE)]))

inTrainingSet.benignSize    = round(nrow(filter(db.completeCases, MalignancyCharacter==0)) *
                                    TRAINING.SIZE/nrow(db.completeCases))
inTrainingSet.malignantSize = round(nrow(filter(db.completeCases, MalignancyCharacter==1)) *
                                    TRAINING.SIZE/nrow(db.completeCases))

db.completeCases = arrange(db.completeCases, MalignancyCharacter) # sort before strata

inTrainingSet = strata(db.completeCases, "MalignancyCharacter",
                       c(inTrainingSet.benignSize, inTrainingSet.malignantSize),
                       "srswor")$ID_unit

inTestSet = apply(select(db.incompleteCases, one_of(COLS.OBSC)), 1,
                  function(row) {sum(is.na(row))/length(row) <= OBSCURE.MAX})

db.training = db.completeCases[inTrainingSet, ]
db.test = bind_rows(db.completeCases[-inTrainingSet, ], # add a few remaining compl. cases
                    db.incompleteCases[inTestSet, ])

The following figure shows a distribution of patients with regard to the percentage of missing values in the dataset:

Sets summary

The resulting db.training has \(109\) benign and \(91\) malignant cases.

fdSummary = t(sapply(select(db.training, -PatientId), 
                function(x){c(
                min=min(x, na.rm=T), 
                median=median(x, na.rm=T), 
                max=max(x, na.rm=T),
                NAs=sum(is.na(x))
             )}))
print(round(fdSummary, 2))
                            min median     max NAs
MalignancyCharacter        0.00   0.00    1.00   0
OvarianCancerInFamily      0.00   0.00    1.00   0
HormoneReplacementTherapy  0.00   0.00    1.00   0
Age                       13.00  47.00   84.00   0
ADimension                25.00  83.00  200.00   0
PainAtExamination          0.00   0.00    1.00   0
Ascites                    0.00   0.00    1.00   0
PapBloodFlow               0.00   0.00    1.00   0
Solid                      0.00   0.00    1.00   0
ASolidDimension            0.00  16.50  150.00   0
InnerWall                  0.00   1.00    1.00   0
Shadow                     0.00   0.00    1.00   0
Color                      1.00   3.00    4.00   0
Septum                     0.00   1.00    1.00   0
SmEchogenicity             0.00   3.00    4.00   0
Location                   0.00   0.00    2.00   0
SmInnerWallThickness       0.00   1.00    2.00   0
TumorVolume                4.40 205.19 4186.40   0
Pap                        0.00   1.00    1.00   0
APapDimension              0.00   1.00   51.00   0
SeptumThickness            0.00   2.00    8.00   0
AgeAfterMenopause          0.00   0.00   33.00   0
Ca125                      0.53  57.84 4909.19   0
Ri                         0.21   0.65    1.00   0
UterusRemoved              0.00   0.00    1.00   0
IotaQuality                1.00   4.00    6.00   0

The resulting db.test has \(123\) benign and \(52\) malignant cases.

odSummary = t(sapply(select(db.test, -PatientId), 
                function(x){c(
                min=min(x, na.rm=T), 
                median=median(x, na.rm=T), 
                max=max(x, na.rm=T),
                NAs=sum(is.na(x))
             )}))
print(round(odSummary, 2))
                           min median    max NAs
MalignancyCharacter        0.0   0.00    1.0   0
OvarianCancerInFamily      0.0   0.00    1.0  39
HormoneReplacementTherapy  0.0   0.00    1.0   0
Age                       14.0  44.00   82.0   0
ADimension                28.0  70.00  230.0   0
PainAtExamination          0.0   0.00    1.0   0
Ascites                    0.0   0.00    1.0   1
PapBloodFlow               0.0   0.00    1.0  65
Solid                      0.0   0.00    1.0  49
ASolidDimension            0.0  20.00  300.0  53
InnerWall                  0.0   1.00    1.0   0
Shadow                     0.0   0.00    1.0  58
Color                      1.0   1.00    4.0  56
Septum                     0.0   0.00    1.0   3
SmEchogenicity             0.0   3.00    4.0   3
Location                   0.0   0.00    1.0   2
SmInnerWallThickness       0.0   0.50    2.0   5
TumorVolume                4.1 121.60 4186.4   0
Pap                        0.0   1.00    1.0  11
APapDimension              0.0   0.00   51.0  26
SeptumThickness            0.0   0.00   11.0  11
AgeAfterMenopause          0.0   0.00   38.0   0
Ca125                      4.3  29.35 3126.0  42
Ri                         0.0   1.00    1.0  56
UterusRemoved              0.0   0.00    1.0   0
IotaQuality                1.0   3.00    6.0   2

The following figure shows a distribution of classes in the training and test sets:

Building the datasets

Because we can not share the initial database, we will create and provide simplified preprocessed datasets. From this moment, by the training and test sets we mean the calculated outputs of the uncertaintified diagnostic models.

The same structure is used for both the training and test sets (training.data and test.data data frames, respectively). First \(4\) columns of a data frame indicate the following:

  1. PatientId - a patient’s identifier in the original database,
  2. ObscureLevel - a percentage of a patient’s attributes with missing data,
  3. ObscureRepeat - a number of the iteration for a given ObscureLevel (in the simulation step),
  4. MalignancyCharacter - an actual diagnosis for a given patient.

Next 12 columns contain lower and upper bounds of a given uncertaintified diagnostic model.

Training set

The code below fills the training.data data frame with values obtained by the application of the diagnostic models to the obscured patient data. For each obscuration level and each repeat, a subset of the database is randomly obscured. Then the output is calculated for the uncertaintified diagnostic models. The results are stored sequentialy in the data frame.

Observe that dataset contain also data without obscuration (ObscureLevel==0 for the top rows). Although some missing values can be filled like in Preprocessing step, for the simplicity and clarity of the simulation process we leave them as they are.

training.data = data.frame()

for (obsc.lvl in OBSCURE.PERCENTAGES)
{
    for (obsc.rep in 1:OBSUCRE.REPEAT)
    {
        printDebug(paste(obsc.lvl, obsc.rep))

        # sample the database to obtain balanced dataset
        db.training.sample = bind_rows(sample_n(filter(db.training, MalignancyCharacter==0),
                                                PROBE.SIZE.NEGATIVE),
                                       sample_n(filter(db.training, MalignancyCharacter==1),
                                                PROBE.SIZE.POSITIVE))

        if (obsc.lvl > 0)
        {
            naMat = matrix(0, nrow=nrow(db.training.sample), ncol=length(COLS.OBSC))

            # put NA values into the matrix to obtain required percentage
            # of missing values
            naMat[sample(1:length(naMat), floor(obsc.lvl*length(naMat)))] = NA

            # set NA value in random places in patient database
            # (NA added to any value results in NA)
            db.training.sample[, COLS.OBSC] = db.training.sample[, COLS.OBSC] + naMat
        }

        results = do.call(cbind,
                          sapply(seq_along(METHODS),
                                 function(m)
                                 {
                                   method = METHODS[[m]]
                                   name   = METHODS.NAME[[m]]
                                   cols   = METHODS.COL[[m]]

                                   t(apply(db.training.sample[cols], 1, method))
                                 },
                                 simplify=FALSE))

        training.data = bind_rows(training.data,
                                  with(db.training.sample,
                                       data.frame(PatientId,
                                                  ObscureLevel=obsc.lvl,
                                                  ObscureRepeat=obsc.rep,
                                                  MalignancyCharacter,
                                                  results)))
    }
}

The sample structure of the resulting simulated dataset for the uncertaintified diagnostic models is following:

head(as.data.frame(training.data), n=3)
  PatientId ObscureLevel ObscureRepeat MalignancyCharacter   LR1.min   LR1.max   LR2.min   LR2.max SM.min
1    Sms409            0             1                   0 0.3256547 0.3256547 0.3199750 0.3199750  0.125
2    Sms413            0             1                   0 0.3545262 0.3545262 0.3353151 0.3353151  0.125
3    Sms087            0             1                   0 0.4717678 0.4717678 0.4552104 0.4552104  0.375
  SM.max    Tim.min    Tim.max Alc.min Alc.max  RMI.min  RMI.max
1  0.125 0.04860564 0.04860564       0       0 0.119275 0.119275
2  0.125 0.03753158 0.03753158       0       0 0.000000 0.000000
3  0.375 0.15890362 0.15890362       0       0 0.000000 0.000000

Test set

Similarly, the code below fills the test.data data frame with values obtained by the application of the diagnostic models to the patients with the incomplete features. In this case lack of data is an effect of a physician’s decision.

ObscureLevel is calculated as the real percentage of missing data. ObscureRepeat is fixed and set to \(1\).

results = do.call(cbind,
                  sapply(seq_along(METHODS),
                         function(m)
                         {
                            method = METHODS[[m]]
                            name   = METHODS.NAME[[m]]
                            cols   = METHODS.COL[[m]]

                            t(apply(db.test[cols], 1, method))
                         },
                         simplify=FALSE))

obscLevels = apply(select(db.test, one_of(COLS.OBSC)), 1,
                   function(row){round(sum(is.na(row))/length(row), 2)})

test.data = with(db.test, data.frame(PatientId,
                                     ObscureLevel=obscLevels,
                                     ObscureRepeat=1,
                                     MalignancyCharacter,
                                     results))

The sample structure of the resulting evaluation dataset for the uncertaintified diagnostic models is following:

tail(test.data, n=3)
    PatientId ObscureLevel ObscureRepeat MalignancyCharacter   LR1.min   LR1.max    LR2.min   LR2.max SM.min
173     Sz421         0.15             1                   1 0.7101402 0.7101402 0.82713077 0.8271308 0.7000
174     Sz443         0.10             1                   1 0.4557590 0.4557590 0.40919018 0.4091902 0.5500
175     Sz304         0.35             1                   0 0.2008450 0.6862746 0.08283644 0.6065683 0.1875
    SM.max   Tim.min   Tim.max   Alc.min   Alc.max RMI.min   RMI.max
173  0.850 0.1196401 0.7786164 0.3333333 0.5000000 0.00000 1.0000000
174  0.650 0.0394265 0.3611866 0.0000000 0.1666667 0.09675 0.0967500
175  0.375 0.0000000 0.7271850 0.0000000 0.5000000 0.00000 0.5348837

Saving the results

The constructed datasets are stored as separate CSV files.

write.csv(training.data, TRAINING.LOCATION, row.names=FALSE)
write.csv(test.data,     TEST.LOCATION,     row.names=FALSE)