Document generation date: 2016-03-19 18:54:38

Executive summary

This document presents the process of training and evaluation of different aggregation operators and thresholding strategies in medical diagnosis support under data incompleteness.

Introduction

In the Analytic datasets construction document we described how to obtain two sets, which consist of lists of interval-valued diagnoses contained in \([0,1]\):

In the following document we describe how to choose and evaluate aggregation strategies in a two-step scheme. Firstly, we utilize the dataset with simulated missing values to optimize different aggregation strategies. Secondly, we use the dataset with truly missing values to performance evaluation.

For sake of readibility, some code is externalized into separate R files.

rm(list=ls())

source("config.R")
source("stats.R")
source("methods.R")
source("aggregators.R")
source("aggregators-optimize.R")
source("mcn-test.R")
source("utils.R")

library(parallel)
library(reshape2)
library(dplyr)
library(boot)

Overview of supplementary files

Initial configuration

An initial setup for the process is stored in config.R file. Besides setting a seed for random procesures (SEED constant) and setting constants for input/output files and directories (*.DIR/*.FILE/*.LOCATION constants), there are some values which are important for the evaluation process.

As a measure of classification effectiveness we choose PERFORMANCE.MEASURE = Cost matrix. Since the lower values of this measure mean better performance, PERFORMANCE.MEASURE.DESC = FALSE.

To parallelize calculation process there will be used THREADS = 32 threads.

To estimate uncertainty performance of final models, we perform stratified bootstraping with BOOTSTRAP_R = 500 replications.

Diagnostic models

The uncertaintified diagnostic models are implemented in methods.R file and they are breifly described in the Analytic datasets construction document.

Aggregation operators and thresholding strategies

Aggregation operators and thresholding strategies are implemented in aggregators.R, aggregators-helpers.R and aggregators-optimize.R files. A description of the operators is beyond the technical scope of this document.

Statistics

All statistical indicators are implemented in stats.R file. Most of them are standard statistics: accuracy, sensitivity and specificity. These with suffix .all are adjusted variations to extend statistic to three cases: malignant, benign, no diagnosis.

The statistic operates on six possible outcomes of the diagnosis:

  • TP - true positive,
  • TN - true negative,
  • FN - false negative,
  • FP - false positive,
  • N0 - no prediction when tumor was benign,
  • N1 - no prediction when tumor was malignant.

Remark: in reality, a classifier or an aggregation method may only output NA in case of no prediction; the distinction between N0 and N1 is introduced in the experiment only for the purpose of the performance evaluation.

A decisiveness means the ability to make a diagnosis (\(0\) - a model never gives a prediction, \(1\) - a model always gives a prediction). A cost matrix results a sum of the six possible outcomes with a given weights provided by an expert.

Efficiacy calculations

Functions for efficacy calculation of the models and the aggregators are implementd in utils.R. This calculation process will be described later in the document.

Auxiliary functions

Debugging functions are implemented in utils.R file. A function for McNemar’s test is stored in mcn-test.R file.

Loading the data

The datasets are stored in CSV files.

printDebug("read datasets")

colClass = c("factor", "numeric", "integer", "integer",
             rep("numeric", times=2*length(METHODS)))

ds.training = read.csv(TRAINING.LOCATION, colClasses=colClass)
ds.test     = read.csv(TEST.LOCATION,     colClasses=colClass)

As it is written in the Analytic datasets construction document, the same structure is used for both the training and test sets (ds.training and ds.test data frames, respectively; in the previous step these data were stored in training.data and test.data variables). First \(4\) columns of a data frame indicate the following:

  1. PatientId - a patient’s identifier in the original database,
  2. ObscureLevel - a percentage of a patient’s attributes with missing data,
  3. ObscureRepeat - a number of the iteration for a given ObscureLevel (only valid in simulation phase),
  4. MalignancyCharacter - an actual diagnosis for a given patient.

Remaining columns contain lower and upper bounds of a given uncertaintified diagnostic model.

A sample overview of the datasets is following:

head(ds.training, n=3)
  PatientId ObscureLevel ObscureRepeat MalignancyCharacter   LR1.min   LR1.max   LR2.min   LR2.max SM.min
1    Sms409            0             1                   0 0.3256547 0.3256547 0.3199750 0.3199750  0.125
2    Sms413            0             1                   0 0.3545262 0.3545262 0.3353151 0.3353151  0.125
3    Sms087            0             1                   0 0.4717678 0.4717678 0.4552104 0.4552104  0.375
  SM.max    Tim.min    Tim.max Alc.min Alc.max  RMI.min  RMI.max
1  0.125 0.04860564 0.04860564       0       0 0.119275 0.119275
2  0.125 0.03753158 0.03753158       0       0 0.000000 0.000000
3  0.375 0.15890362 0.15890362       0       0 0.000000 0.000000
tail(ds.test, n=3)
    PatientId ObscureLevel ObscureRepeat MalignancyCharacter   LR1.min   LR1.max    LR2.min   LR2.max SM.min
173     Sz421         0.15             1                   1 0.7101402 0.7101402 0.82713077 0.8271308 0.7000
174     Sz443         0.10             1                   1 0.4557590 0.4557590 0.40919018 0.4091902 0.5500
175     Sz304         0.35             1                   0 0.2008450 0.6862746 0.08283644 0.6065683 0.1875
    SM.max   Tim.min   Tim.max   Alc.min   Alc.max RMI.min   RMI.max
173  0.850 0.1196401 0.7786164 0.3333333 0.5000000 0.00000 1.0000000
174  0.650 0.0394265 0.3611866 0.0000000 0.1666667 0.09675 0.0967500
175  0.375 0.0000000 0.7271850 0.0000000 0.5000000 0.00000 0.5348837

Setup for parallel calculations

Whole procedure is very time-consuming. In order to get results in reasonable time, we make use of parallel calculations. In this step we prepare the environment to be capable of such calculations (export variables to workers, etc.). All time-consuming calculations are done by use of usedLapply function (either single-threaded or parallel).

printDebug("paralell init")

if (THREADS > 1)
{
    CL = makeCluster(THREADS, outfile="")

    clusterExport(CL, c("SEED", "OBSUCRE.REPEAT"))
    clusterCall(CL, function(){ set.seed(SEED) })

    clusterExport(cl=CL, list('AGGREGATORS', 'AGGREGATORS.NAME', 'METHODS',
                              'METHODS.NAME', 'ds.training', 'ds.test',
                              'diagnosisToOutcome', 'CUTOFF.CRISP', 'W.AUC',
                              'COMMON.PART', 'INTERVAL.INTERSECTION'))

    usedLapply = function(...){ parLapply(CL, ...) }
} else {
    usedLapply = lapply
}

Training phase

In this step ds.training dataset is processed.

Models

For the reference purposes we check how the diagnostic models, both original and uncertaintified, perform on the simulated training set.

Firstly, for each patient we calculate outputs of each single diagnostic model. This is achieved by serveral consequtive steps:

  • *.models.outcomes for each row (patient) returns results of each 6 diagnostic model; if an upper bound is less than \(0.5\), then a model predicts a benign tumor; if lowe bound is greater or equal to \(0.5\), then a model predicts a malignant tumor; for each particular case an output can be one of the mentioned above outcomes (TP, TN, etc.).
  • aggregate.outcomes combines the results of *.models.outcomes, namely for each obscuration level and obscuration repeat it gathers outcomes of each model,
  • calculate.stats takes the results of aggregate.outcomes and calculates performance statistics with respect to the obscuration level and models.

Afterwards the results are converted into a long format by use of melt function. Finally, a column is renamed to be more descriptive and three columns are added to latterly identify the diagnostic models from aggregation operators.

The following code produces results for the original models:

printDebug("training statistics models original")

training.stats.models.orig = melt(
        calculate.stats(
            aggregate.outcomes(
                orig.models.outcomes(ds.training)
            )
        ),
        id.vars=c("ObscureLevel"),
        variable.name="Method",
        value.name="Value") %>%
    rename(Measure=L1) %>%
    mutate(Class="Model", Subclass="Original", Subsubclass=NA)

training.stats.models.orig$Method = paste("orig.", training.stats.models.orig$Method)

The following code produces results for the uncertaintified models:

printDebug("training statistics models uncertaintified")

training.stats.models.uncer = melt(
        calculate.stats(
            aggregate.outcomes(
                unc.models.outcomes(ds.training)
            )
        ),
        id.vars=c("ObscureLevel"),
        variable.name="Method",
        value.name="Value") %>%
    rename(Measure=L1) %>%
    mutate(Class="Model", Subclass="Uncertaintified", Subsubclass=NA)

training.stats.models.uncer$Method = paste("unc.", training.stats.models.uncer$Method)

Aggregation operators and thresholding strategies

Secondly, similar procedure is performed for the aggregation operators:

  • outcomes.aggrs data frame is constructed by the application of each aggregation operator to the simulation data; diags is a vector of zeros and ones, which are converted to the mentioned above six possible outcomes (TP, TN, etc.),
  • aggregate.outcomes combines the results of outcomes.aggrs,
  • calculate.stats takes the results of aggregate.outcomes and calculates performance statistics with respect to the obscuration level and aggregation operators.

Afterwards the results are converted into a long format by use of melt function. Finally, a column is renamed to be more descriptive and three columns are added to latterly identify the aggregation operators.

printDebug("training statistics aggregators")

outcomes.aggrs = usedLapply(1:length(AGGREGATORS), function(i){

    aggr = AGGREGATORS[[i]]
    # selection of appropriate columns
    diags = apply(ds.training[, 5:(5+length(METHODS)*2-1)], 1, function(row) {
        # matrix in format required by aggregation method is created and passed into aggr
        return(aggr(matrix(row, nrow=2)))
    })
    converted = apply(cbind(diags, ds.training$MalignancyCharacter),
                      1, diagnosisToOutcome)
    return(converted)
})
outcomes.aggrs        = data.frame(ds.training[, 1:3], outcomes.aggrs)
names(outcomes.aggrs) = c(names(ds.training)[1:3], AGGREGATORS.NAME)

training.stats.aggrs = melt(calculate.stats(
                            aggregate.outcomes(outcomes.aggrs)
                        ),
                        id.vars = "ObscureLevel" ,
                        variable.name = "Method",
                        value.name = "Value") %>%
                   rename(Measure=L1)

training.stats.aggrs = suppressWarnings( # suppress different factor levels warning
                            left_join(training.stats.aggrs,
                                      AGGREGATORS.BINDED.DESCRIPTION,
                                      by="Method")
                       )

Performance calculation

The simulation results for the models and aggregation operators are combined into one data frame.

printDebug("training statistics bind")

training.stats.all = bind_rows(training.stats.models.orig,
                               training.stats.models.uncer,
                               training.stats.aggrs)

For a given statistic, the performance of each model and aggregation operator is calculated as a mean value of the statistic over all considered levels of missing data.

Finally, three columns are added to identify the models and aggregation operators.

printDebug("training statistics performance calculation")

training.stats.all.perf = aggregate(training.stats.all$Value,
                                    list(Method=training.stats.all$Method,
                                         Measure=training.stats.all$Measure),
                                    mean) %>%
                          rename(Value=x)

training.stats.all.perf = left_join(training.stats.all.perf,
                                    distinct(select(training.stats.all,
                                              Method, Class, Subclass, Subsubclass)),
                                    by="Method")

The aggregation operators and thresholding strategies are optimised with regards to numerical parameters. As the last step of the training phase, all statistics for the models and aggregation strategies are combined.

printDebug("select optimized aggregators")

optimizedAggregatorsNames = getOptimizedAggregators(training.stats.all.perf, PERFORMANCE.MEASURE)

training.stats.all.perf = subset(training.stats.all.perf,
                                 Method %in% c(optimizedAggregatorsNames,
                                             unique(as.character(training.stats.models.orig$Method)),
                                             unique(as.character(training.stats.models.uncer$Method))))

Test phase

In this step ds.test datasets is processed.

After optimisation of the aggregation strategies on the training data, we will evaluate the models and the aggregation operators on the data where patients’ features are missing due to decisions of a physician.

The procedure of performance calculation is very similar, hence the same functions will be used. Because of relatively small ds.test dataset, this time we will not distinguish different obscuration level - we will calculate statistics over all cases. To perform this operation, we set the same missing data level for all patients (this step is made to achieve data adjustment to the interface of the functions and it has no impact on final results).

printDebug("test combine obscuration levels")

ds.test$ObscureLevel = 0

Models

For the reference purposes we check how the diagnostic models, both original and uncertaintified, perform on the test set.

As in the training phase, for each patient we calculate outputs of each single diagnostic model. This is achieved by serveral consequtive steps:

  • *.models.outcomes for each row (patient) returns results of each 6 diagnostic model; this can be one of the mentioned above outcomes (TP, TN, etc.).
  • aggregate.outcomes combines the results of *.models.outcomes,
  • calculate.stats takes the results of aggregate.outcomes and calculates performance statistics with respect to the models.

Afterwards the results are converted into a long format by use of melt function. A column is renamed to be more descriptive and three columns are added to identify the diagnostic models from aggregation operators. Finally, the ObscureLevel column is removed, since it is not used in the evaluation step.

The following code produces results for the original models:

printDebug("test statistics models original")

outcomes.orig.models = orig.models.outcomes(ds.test)

test.stats.orig.models = melt(
                            calculate.stats(
                                aggregate.outcomes(
                                    outcomes.orig.models
                                )
                            ),
                            id.vars="ObscureLevel",
                            variable.name="Method",
                         value.name="Value") %>%
                         rename(Measure=L1) %>%
                         mutate(Class="Model", Subclass="Original", Subsubclass=NA) %>%
                         select(-ObscureLevel)

test.stats.orig.models$Method = paste("orig.", test.stats.orig.models$Method)
colnames(outcomes.orig.models)[4:(4+length(METHODS)-1)] =
    paste("orig.", colnames(outcomes.orig.models))[4:(4+length(METHODS)-1)]

boot.test.stats.orig.models = boot(outcomes.orig.models, bootStat, BOOTSTRAP_R,
                                   strata=ds.test$MalignancyCharacter)

test.stats.orig.models$CI = sapply(1:nrow(test.stats.orig.models),
   function(i){getBootCI(boot.test.stats.orig.models, i)})

The following code produces results for the uncertaintified models:

printDebug("test statistics models uncertaintified")

outcomes.models.unc = unc.models.outcomes(ds.test)

test.stats.uncer.models = melt(
                            calculate.stats(
                                aggregate.outcomes(
                                    outcomes.models.unc
                                )
                            ),
                            id.vars="ObscureLevel",
                            variable.name="Method",
                            value.name="Value") %>%
                          rename(Measure=L1) %>%
                          mutate(Class="Model", Subclass="Uncertaintified", Subsubclass=NA) %>%
                          select(-ObscureLevel)

test.stats.uncer.models$Method = paste("unc.", test.stats.uncer.models$Method)
colnames(outcomes.models.unc)[4:(4+length(METHODS)-1)] =
    paste("unc.", colnames(outcomes.models.unc))[4:(4+length(METHODS)-1)]

boot.test.stats.uncer.models = boot(outcomes.models.unc, bootStat, BOOTSTRAP_R,
                                    strata=ds.test$MalignancyCharacter)

test.stats.uncer.models$CI = sapply(1:nrow(test.stats.uncer.models),
    function(i){getBootCI(boot.test.stats.uncer.models, i)})

Both steps perform stratified bootstraping to estimate performance uncertainty.

Aggregation operators and thresholding strategies

Similar procedure is performed for the aggregation strategies:

  • outcomes.aggrs data frame is constructed by the application of each aggregation operator to the simulation data; diags is a vector of zeros and ones, which are converted to the mentioned above six possible outcomes (TP, TN, etc.),
  • aggregate.outcomes combines the results of outcomes.aggrs,
  • calculate.stats takes the results of aggregate.outcomes and calculates performance statistics with respect to the aggregation operators.

Afterwards the results are converted into a long format by use of melt function. A column is renamed to be more descriptive and the ObscureLevel column is removed, since it is not used in the evaluation step. Finally, three columns are added to identify the aggregation operators.

printDebug("test statistics aggregators")

aggregators.from.training = unique(subset(training.stats.all.perf, Class=="Aggregation")$Method)

if (THREADS > 1)
    clusterExport(CL, c("aggregators.from.training"))

outcomes.aggrs = usedLapply(1:length(aggregators.from.training), function(j){
    i = which(aggregators.from.training[j] == AGGREGATORS.NAME)
    aggr = AGGREGATORS[[i]]
    # selection of appropriate columns
    diags = apply(ds.test[, 5:(5+length(METHODS)*2-1)], 1, function(row) {
        # matrix in format required by aggregation method is created and passed into aggr
        return(aggr(matrix(row, nrow=2)))
    })
    converted = apply(cbind(diags, ds.test$MalignancyCharacter),
                      1, diagnosisToOutcome)
    return(converted)
})
outcomes.aggrs        = data.frame(ds.test[, 1:3], outcomes.aggrs)
names(outcomes.aggrs) = c(names(ds.test)[1:3], aggregators.from.training)

test.stats.aggrs = melt(calculate.stats(
                                aggregate.outcomes(outcomes.aggrs)
                            ),
                            id.vars = "ObscureLevel" ,
                            variable.name = "Method",
                            value.name = "Value") %>%
                       rename(Measure=L1) %>%
                       select(-ObscureLevel)

test.stats.aggrs = suppressWarnings( # suppress different factor levels warning
                        left_join(test.stats.aggrs,
                                  AGGREGATORS.BINDED.DESCRIPTION,
                                  by="Method")
                   )

boot.test.stats.aggrs = boot(outcomes.aggrs, bootStat, BOOTSTRAP_R,
                             strata=ds.test$MalignancyCharacter)

test.stats.aggrs$CI = sapply(1:nrow(test.stats.aggrs),
    function(i){getBootCI(boot.test.stats.aggrs, i)})

This step also performs stratified bootstraping to estimate performance uncertainty.

Binding results

Eventually, we combine the statistics of the models and aggregation strategies obtained on the test set.

printDebug("test statistics bind")

test.stats.all = bind_rows(test.stats.orig.models,
                           test.stats.uncer.models,
                           test.stats.aggrs)

Converting results for summary

To achieve better outlook on the results, several statistics are bound.

To statistics obtained on the test set we add those achieved on the training set.

printDebug("test statistics performance bind with training")

binded.stats.all.perf = left_join(select(training.stats.all.perf, Method, Measure, Value),
                                  test.stats.all,
                                  by=c("Method", "Measure")) %>%
                        rename(Value.training=Value.x, Value.test=Value.y)

Secondly, we convert statistics from long to wide format.

printDebug("convert statistics performance to wide format")

training.stats.all.perf.wide = dcast(training.stats.all.perf,
                                     Method + Class + Subclass + Subsubclass ~ Measure,
                                     value.var="Value")

test.stats.all.wide          = dcast(test.stats.all,
                                     Method + Class + Subclass + Subsubclass ~ Measure,
                                     value.var="Value")

As the last step of the test phase, we select a collection of promising aggregation strategies which might be suitable to the medical criteria and implemented for further performance validation. We also perform the McNemar’s test among the selected aggregation strategies and with relation to the uncertaintified models.

printDebug("aggregators selection and statistical tests")

selected.aggrs = subset(test.stats.all.wide,
                        Class=="Aggregation" &
                            Decisiveness>=0.95 &
                            Decisiveness<1.0 &
                            Sensitivity>Specificity &
                            Sensitivity>=0.90 &
                            Specificity>0.8)

perf.selected.aggrs = subset(binded.stats.all.perf,
                             Measure==PERFORMANCE.MEASURE &
                                 Method %in% selected.aggrs$Method)


perf.all = rbind(perf.selected.aggrs,
                 subset(binded.stats.all.perf,
                        Measure==PERFORMANCE.MEASURE & Class=="Model" & Subclass=="Uncertaintified"))

outcomes.all = bind_cols(select(outcomes.models.unc, -(PatientId:ObscureRepeat)),
                         select(outcomes.aggrs,      -(PatientId:ObscureRepeat)))

pvals = sapply(subset(perf.all, Class=="Aggregation")$Method, function(a1) {
    sapply(perf.all$Method, function(a2) {
        mcn.test(outcomes.all[a1], outcomes.all[a2])
    })
})

pvals[upper.tri(pvals, diag=TRUE)] = NA

pvals = matrix(p.adjust(pvals, method = "BH",
                        n=sum(!is.na(pvals)|is.nan(pvals))),
               nrow=nrow(pvals),
               dimnames=list(rownames(pvals), colnames(pvals)))

Clean up parallel calculations

When all calculations are done, we can close all connections to parallel workers.

printDebug("parallel shutdown")

if (THREADS > 1)
    stopCluster(CL)

Saving the results

Lastly, we save all results (and envirionment) to a binary R file.

printDebug("save evaluation")

save.image(EVALUATION.OUTPUT.LOCATION)