Document generation date: 2016-03-19 18:54:38
This document presents the process of training and evaluation of different aggregation operators and thresholding strategies in medical diagnosis support under data incompleteness.
In the Analytic datasets construction document we described how to obtain two sets, which consist of lists of interval-valued diagnoses contained in \([0,1]\):
In the following document we describe how to choose and evaluate aggregation strategies in a two-step scheme. Firstly, we utilize the dataset with simulated missing values to optimize different aggregation strategies. Secondly, we use the dataset with truly missing values to performance evaluation.
For sake of readibility, some code is externalized into separate R files.
rm(list=ls())
source("config.R")
source("stats.R")
source("methods.R")
source("aggregators.R")
source("aggregators-optimize.R")
source("mcn-test.R")
source("utils.R")
library(parallel)
library(reshape2)
library(dplyr)
library(boot)
An initial setup for the process is stored in config.R
file. Besides setting a seed for random procesures (SEED
constant) and setting constants for input/output files and directories (*.DIR
/*.FILE
/*.LOCATION
constants), there are some values which are important for the evaluation process.
As a measure of classification effectiveness we choose PERFORMANCE.MEASURE
= Cost matrix. Since the lower values of this measure mean better performance, PERFORMANCE.MEASURE.DESC
= FALSE.
To parallelize calculation process there will be used THREADS
= 32 threads.
To estimate uncertainty performance of final models, we perform stratified bootstraping with BOOTSTRAP_R
= 500 replications.
The uncertaintified diagnostic models are implemented in methods.R
file and they are breifly described in the Analytic datasets construction document.
Aggregation operators and thresholding strategies are implemented in aggregators.R
, aggregators-helpers.R
and aggregators-optimize.R
files. A description of the operators is beyond the technical scope of this document.
All statistical indicators are implemented in stats.R
file. Most of them are standard statistics: accuracy, sensitivity and specificity. These with suffix .all
are adjusted variations to extend statistic to three cases: malignant, benign, no diagnosis.
The statistic operates on six possible outcomes of the diagnosis:
TP
- true positive,TN
- true negative,FN
- false negative,FP
- false positive,N0
- no prediction when tumor was benign,N1
- no prediction when tumor was malignant.Remark: in reality, a classifier or an aggregation method may only output NA
in case of no prediction; the distinction between N0
and N1
is introduced in the experiment only for the purpose of the performance evaluation.
A decisiveness means the ability to make a diagnosis (\(0\) - a model never gives a prediction, \(1\) - a model always gives a prediction). A cost matrix results a sum of the six possible outcomes with a given weights provided by an expert.
Functions for efficacy calculation of the models and the aggregators are implementd in utils.R
. This calculation process will be described later in the document.
Debugging functions are implemented in utils.R
file. A function for McNemar’s test is stored in mcn-test.R
file.
The datasets are stored in CSV files.
printDebug("read datasets")
colClass = c("factor", "numeric", "integer", "integer",
rep("numeric", times=2*length(METHODS)))
ds.training = read.csv(TRAINING.LOCATION, colClasses=colClass)
ds.test = read.csv(TEST.LOCATION, colClasses=colClass)
As it is written in the Analytic datasets construction document, the same structure is used for both the training and test sets (ds.training
and ds.test
data frames, respectively; in the previous step these data were stored in training.data
and test.data
variables). First \(4\) columns of a data frame indicate the following:
PatientId
- a patient’s identifier in the original database,ObscureLevel
- a percentage of a patient’s attributes with missing data,ObscureRepeat
- a number of the iteration for a given ObscureLevel
(only valid in simulation phase),MalignancyCharacter
- an actual diagnosis for a given patient.Remaining columns contain lower and upper bounds of a given uncertaintified diagnostic model.
A sample overview of the datasets is following:
head(ds.training, n=3)
PatientId ObscureLevel ObscureRepeat MalignancyCharacter LR1.min LR1.max LR2.min LR2.max SM.min
1 Sms409 0 1 0 0.3256547 0.3256547 0.3199750 0.3199750 0.125
2 Sms413 0 1 0 0.3545262 0.3545262 0.3353151 0.3353151 0.125
3 Sms087 0 1 0 0.4717678 0.4717678 0.4552104 0.4552104 0.375
SM.max Tim.min Tim.max Alc.min Alc.max RMI.min RMI.max
1 0.125 0.04860564 0.04860564 0 0 0.119275 0.119275
2 0.125 0.03753158 0.03753158 0 0 0.000000 0.000000
3 0.375 0.15890362 0.15890362 0 0 0.000000 0.000000
tail(ds.test, n=3)
PatientId ObscureLevel ObscureRepeat MalignancyCharacter LR1.min LR1.max LR2.min LR2.max SM.min
173 Sz421 0.15 1 1 0.7101402 0.7101402 0.82713077 0.8271308 0.7000
174 Sz443 0.10 1 1 0.4557590 0.4557590 0.40919018 0.4091902 0.5500
175 Sz304 0.35 1 0 0.2008450 0.6862746 0.08283644 0.6065683 0.1875
SM.max Tim.min Tim.max Alc.min Alc.max RMI.min RMI.max
173 0.850 0.1196401 0.7786164 0.3333333 0.5000000 0.00000 1.0000000
174 0.650 0.0394265 0.3611866 0.0000000 0.1666667 0.09675 0.0967500
175 0.375 0.0000000 0.7271850 0.0000000 0.5000000 0.00000 0.5348837
Whole procedure is very time-consuming. In order to get results in reasonable time, we make use of parallel calculations. In this step we prepare the environment to be capable of such calculations (export variables to workers, etc.). All time-consuming calculations are done by use of usedLapply
function (either single-threaded or parallel).
printDebug("paralell init")
if (THREADS > 1)
{
CL = makeCluster(THREADS, outfile="")
clusterExport(CL, c("SEED", "OBSUCRE.REPEAT"))
clusterCall(CL, function(){ set.seed(SEED) })
clusterExport(cl=CL, list('AGGREGATORS', 'AGGREGATORS.NAME', 'METHODS',
'METHODS.NAME', 'ds.training', 'ds.test',
'diagnosisToOutcome', 'CUTOFF.CRISP', 'W.AUC',
'COMMON.PART', 'INTERVAL.INTERSECTION'))
usedLapply = function(...){ parLapply(CL, ...) }
} else {
usedLapply = lapply
}
In this step ds.training
dataset is processed.
For the reference purposes we check how the diagnostic models, both original and uncertaintified, perform on the simulated training set.
Firstly, for each patient we calculate outputs of each single diagnostic model. This is achieved by serveral consequtive steps:
*.models.outcomes
for each row (patient) returns results of each 6 diagnostic model; if an upper bound is less than \(0.5\), then a model predicts a benign tumor; if lowe bound is greater or equal to \(0.5\), then a model predicts a malignant tumor; for each particular case an output can be one of the mentioned above outcomes (TP
, TN
, etc.).aggregate.outcomes
combines the results of *.models.outcomes
, namely for each obscuration level and obscuration repeat it gathers outcomes of each model,calculate.stats
takes the results of aggregate.outcomes
and calculates performance statistics with respect to the obscuration level and models.Afterwards the results are converted into a long format by use of melt
function. Finally, a column is renamed to be more descriptive and three columns are added to latterly identify the diagnostic models from aggregation operators.
The following code produces results for the original models:
printDebug("training statistics models original")
training.stats.models.orig = melt(
calculate.stats(
aggregate.outcomes(
orig.models.outcomes(ds.training)
)
),
id.vars=c("ObscureLevel"),
variable.name="Method",
value.name="Value") %>%
rename(Measure=L1) %>%
mutate(Class="Model", Subclass="Original", Subsubclass=NA)
training.stats.models.orig$Method = paste("orig.", training.stats.models.orig$Method)
The following code produces results for the uncertaintified models:
printDebug("training statistics models uncertaintified")
training.stats.models.uncer = melt(
calculate.stats(
aggregate.outcomes(
unc.models.outcomes(ds.training)
)
),
id.vars=c("ObscureLevel"),
variable.name="Method",
value.name="Value") %>%
rename(Measure=L1) %>%
mutate(Class="Model", Subclass="Uncertaintified", Subsubclass=NA)
training.stats.models.uncer$Method = paste("unc.", training.stats.models.uncer$Method)
Secondly, similar procedure is performed for the aggregation operators:
outcomes.aggrs
data frame is constructed by the application of each aggregation operator to the simulation data; diags
is a vector of zeros and ones, which are converted to the mentioned above six possible outcomes (TP
, TN
, etc.),aggregate.outcomes
combines the results of outcomes.aggrs
,calculate.stats
takes the results of aggregate.outcomes
and calculates performance statistics with respect to the obscuration level and aggregation operators.Afterwards the results are converted into a long format by use of melt
function. Finally, a column is renamed to be more descriptive and three columns are added to latterly identify the aggregation operators.
printDebug("training statistics aggregators")
outcomes.aggrs = usedLapply(1:length(AGGREGATORS), function(i){
aggr = AGGREGATORS[[i]]
# selection of appropriate columns
diags = apply(ds.training[, 5:(5+length(METHODS)*2-1)], 1, function(row) {
# matrix in format required by aggregation method is created and passed into aggr
return(aggr(matrix(row, nrow=2)))
})
converted = apply(cbind(diags, ds.training$MalignancyCharacter),
1, diagnosisToOutcome)
return(converted)
})
outcomes.aggrs = data.frame(ds.training[, 1:3], outcomes.aggrs)
names(outcomes.aggrs) = c(names(ds.training)[1:3], AGGREGATORS.NAME)
training.stats.aggrs = melt(calculate.stats(
aggregate.outcomes(outcomes.aggrs)
),
id.vars = "ObscureLevel" ,
variable.name = "Method",
value.name = "Value") %>%
rename(Measure=L1)
training.stats.aggrs = suppressWarnings( # suppress different factor levels warning
left_join(training.stats.aggrs,
AGGREGATORS.BINDED.DESCRIPTION,
by="Method")
)
The simulation results for the models and aggregation operators are combined into one data frame.
printDebug("training statistics bind")
training.stats.all = bind_rows(training.stats.models.orig,
training.stats.models.uncer,
training.stats.aggrs)
For a given statistic, the performance of each model and aggregation operator is calculated as a mean value of the statistic over all considered levels of missing data.
Finally, three columns are added to identify the models and aggregation operators.
printDebug("training statistics performance calculation")
training.stats.all.perf = aggregate(training.stats.all$Value,
list(Method=training.stats.all$Method,
Measure=training.stats.all$Measure),
mean) %>%
rename(Value=x)
training.stats.all.perf = left_join(training.stats.all.perf,
distinct(select(training.stats.all,
Method, Class, Subclass, Subsubclass)),
by="Method")
The aggregation operators and thresholding strategies are optimised with regards to numerical parameters. As the last step of the training phase, all statistics for the models and aggregation strategies are combined.
printDebug("select optimized aggregators")
optimizedAggregatorsNames = getOptimizedAggregators(training.stats.all.perf, PERFORMANCE.MEASURE)
training.stats.all.perf = subset(training.stats.all.perf,
Method %in% c(optimizedAggregatorsNames,
unique(as.character(training.stats.models.orig$Method)),
unique(as.character(training.stats.models.uncer$Method))))
In this step ds.test
datasets is processed.
After optimisation of the aggregation strategies on the training data, we will evaluate the models and the aggregation operators on the data where patients’ features are missing due to decisions of a physician.
The procedure of performance calculation is very similar, hence the same functions will be used. Because of relatively small ds.test
dataset, this time we will not distinguish different obscuration level - we will calculate statistics over all cases. To perform this operation, we set the same missing data level for all patients (this step is made to achieve data adjustment to the interface of the functions and it has no impact on final results).
printDebug("test combine obscuration levels")
ds.test$ObscureLevel = 0
For the reference purposes we check how the diagnostic models, both original and uncertaintified, perform on the test set.
As in the training phase, for each patient we calculate outputs of each single diagnostic model. This is achieved by serveral consequtive steps:
*.models.outcomes
for each row (patient) returns results of each 6 diagnostic model; this can be one of the mentioned above outcomes (TP
, TN
, etc.).aggregate.outcomes
combines the results of *.models.outcomes
,calculate.stats
takes the results of aggregate.outcomes
and calculates performance statistics with respect to the models.Afterwards the results are converted into a long format by use of melt
function. A column is renamed to be more descriptive and three columns are added to identify the diagnostic models from aggregation operators. Finally, the ObscureLevel
column is removed, since it is not used in the evaluation step.
The following code produces results for the original models:
printDebug("test statistics models original")
outcomes.orig.models = orig.models.outcomes(ds.test)
test.stats.orig.models = melt(
calculate.stats(
aggregate.outcomes(
outcomes.orig.models
)
),
id.vars="ObscureLevel",
variable.name="Method",
value.name="Value") %>%
rename(Measure=L1) %>%
mutate(Class="Model", Subclass="Original", Subsubclass=NA) %>%
select(-ObscureLevel)
test.stats.orig.models$Method = paste("orig.", test.stats.orig.models$Method)
colnames(outcomes.orig.models)[4:(4+length(METHODS)-1)] =
paste("orig.", colnames(outcomes.orig.models))[4:(4+length(METHODS)-1)]
boot.test.stats.orig.models = boot(outcomes.orig.models, bootStat, BOOTSTRAP_R,
strata=ds.test$MalignancyCharacter)
test.stats.orig.models$CI = sapply(1:nrow(test.stats.orig.models),
function(i){getBootCI(boot.test.stats.orig.models, i)})
The following code produces results for the uncertaintified models:
printDebug("test statistics models uncertaintified")
outcomes.models.unc = unc.models.outcomes(ds.test)
test.stats.uncer.models = melt(
calculate.stats(
aggregate.outcomes(
outcomes.models.unc
)
),
id.vars="ObscureLevel",
variable.name="Method",
value.name="Value") %>%
rename(Measure=L1) %>%
mutate(Class="Model", Subclass="Uncertaintified", Subsubclass=NA) %>%
select(-ObscureLevel)
test.stats.uncer.models$Method = paste("unc.", test.stats.uncer.models$Method)
colnames(outcomes.models.unc)[4:(4+length(METHODS)-1)] =
paste("unc.", colnames(outcomes.models.unc))[4:(4+length(METHODS)-1)]
boot.test.stats.uncer.models = boot(outcomes.models.unc, bootStat, BOOTSTRAP_R,
strata=ds.test$MalignancyCharacter)
test.stats.uncer.models$CI = sapply(1:nrow(test.stats.uncer.models),
function(i){getBootCI(boot.test.stats.uncer.models, i)})
Both steps perform stratified bootstraping to estimate performance uncertainty.
Similar procedure is performed for the aggregation strategies:
outcomes.aggrs
data frame is constructed by the application of each aggregation operator to the simulation data; diags
is a vector of zeros and ones, which are converted to the mentioned above six possible outcomes (TP
, TN
, etc.),aggregate.outcomes
combines the results of outcomes.aggrs
,calculate.stats
takes the results of aggregate.outcomes
and calculates performance statistics with respect to the aggregation operators.Afterwards the results are converted into a long format by use of melt
function. A column is renamed to be more descriptive and the ObscureLevel
column is removed, since it is not used in the evaluation step. Finally, three columns are added to identify the aggregation operators.
printDebug("test statistics aggregators")
aggregators.from.training = unique(subset(training.stats.all.perf, Class=="Aggregation")$Method)
if (THREADS > 1)
clusterExport(CL, c("aggregators.from.training"))
outcomes.aggrs = usedLapply(1:length(aggregators.from.training), function(j){
i = which(aggregators.from.training[j] == AGGREGATORS.NAME)
aggr = AGGREGATORS[[i]]
# selection of appropriate columns
diags = apply(ds.test[, 5:(5+length(METHODS)*2-1)], 1, function(row) {
# matrix in format required by aggregation method is created and passed into aggr
return(aggr(matrix(row, nrow=2)))
})
converted = apply(cbind(diags, ds.test$MalignancyCharacter),
1, diagnosisToOutcome)
return(converted)
})
outcomes.aggrs = data.frame(ds.test[, 1:3], outcomes.aggrs)
names(outcomes.aggrs) = c(names(ds.test)[1:3], aggregators.from.training)
test.stats.aggrs = melt(calculate.stats(
aggregate.outcomes(outcomes.aggrs)
),
id.vars = "ObscureLevel" ,
variable.name = "Method",
value.name = "Value") %>%
rename(Measure=L1) %>%
select(-ObscureLevel)
test.stats.aggrs = suppressWarnings( # suppress different factor levels warning
left_join(test.stats.aggrs,
AGGREGATORS.BINDED.DESCRIPTION,
by="Method")
)
boot.test.stats.aggrs = boot(outcomes.aggrs, bootStat, BOOTSTRAP_R,
strata=ds.test$MalignancyCharacter)
test.stats.aggrs$CI = sapply(1:nrow(test.stats.aggrs),
function(i){getBootCI(boot.test.stats.aggrs, i)})
This step also performs stratified bootstraping to estimate performance uncertainty.
Eventually, we combine the statistics of the models and aggregation strategies obtained on the test set.
printDebug("test statistics bind")
test.stats.all = bind_rows(test.stats.orig.models,
test.stats.uncer.models,
test.stats.aggrs)
To achieve better outlook on the results, several statistics are bound.
To statistics obtained on the test set we add those achieved on the training set.
printDebug("test statistics performance bind with training")
binded.stats.all.perf = left_join(select(training.stats.all.perf, Method, Measure, Value),
test.stats.all,
by=c("Method", "Measure")) %>%
rename(Value.training=Value.x, Value.test=Value.y)
Secondly, we convert statistics from long to wide format.
printDebug("convert statistics performance to wide format")
training.stats.all.perf.wide = dcast(training.stats.all.perf,
Method + Class + Subclass + Subsubclass ~ Measure,
value.var="Value")
test.stats.all.wide = dcast(test.stats.all,
Method + Class + Subclass + Subsubclass ~ Measure,
value.var="Value")
As the last step of the test phase, we select a collection of promising aggregation strategies which might be suitable to the medical criteria and implemented for further performance validation. We also perform the McNemar’s test among the selected aggregation strategies and with relation to the uncertaintified models.
printDebug("aggregators selection and statistical tests")
selected.aggrs = subset(test.stats.all.wide,
Class=="Aggregation" &
Decisiveness>=0.95 &
Decisiveness<1.0 &
Sensitivity>Specificity &
Sensitivity>=0.90 &
Specificity>0.8)
perf.selected.aggrs = subset(binded.stats.all.perf,
Measure==PERFORMANCE.MEASURE &
Method %in% selected.aggrs$Method)
perf.all = rbind(perf.selected.aggrs,
subset(binded.stats.all.perf,
Measure==PERFORMANCE.MEASURE & Class=="Model" & Subclass=="Uncertaintified"))
outcomes.all = bind_cols(select(outcomes.models.unc, -(PatientId:ObscureRepeat)),
select(outcomes.aggrs, -(PatientId:ObscureRepeat)))
pvals = sapply(subset(perf.all, Class=="Aggregation")$Method, function(a1) {
sapply(perf.all$Method, function(a2) {
mcn.test(outcomes.all[a1], outcomes.all[a2])
})
})
pvals[upper.tri(pvals, diag=TRUE)] = NA
pvals = matrix(p.adjust(pvals, method = "BH",
n=sum(!is.na(pvals)|is.nan(pvals))),
nrow=nrow(pvals),
dimnames=list(rownames(pvals), colnames(pvals)))
When all calculations are done, we can close all connections to parallel workers.
printDebug("parallel shutdown")
if (THREADS > 1)
stopCluster(CL)
Lastly, we save all results (and envirionment) to a binary R file.
printDebug("save evaluation")
save.image(EVALUATION.OUTPUT.LOCATION)