Document generation date: 2015-09-18 12:04:47
This document presents the process of obtaining analytic data used for training and evaluation of effectivness of different aggregation operators and thresholding strategies in medical diagnosis support under data incompleteness.
Suppose we have a patient with a diagnosed ovarian tumor. Such a patient can be described by a set of features. The physicians have developed many predictive models, which on the base of the features can predict whether the tumor is malignant or benign. The models assume working on a complete set of the features but frequently some of them are missing. A novel approach is to adjust the models to work with incomplete data and summarize the results from different models. We want to study the performance of different aggregation strategies of such models under different level of data incompleteness. In this document we present a process of obtaining such data for further evaluation.
The outline of the process is as follow. In the first phase, we sample patients from two groups of malignancy. Then we randomly obscure (remove) values of their features. Afterwards we compute the results of the adjusted models on the obscured data. We repeat whole procedure many times for each level of data incompleteness. This training set will be used to optimize aggregation operators and thresholding strategies. In the second phase, we compute the results of the adjusted models on the dataset, where incompleteness is an effect of a physician’s decision. This test set will be used to evaluate of the optimized aggregation stretegies.
For sake of readibility, some code is externalized into separate R files.
rm(list=ls())
source("config.R")
source("methods.R")
source("utils.R")
library(dplyr)
library(sampling)
An initial setup for the process is stored in config.R
file. Besides setting a seed for random procedures (SEED
constant) and setting constants for input/output files and directories (*.DIR
/*.FILE
/*.LOCATION
constants), there are several values which are important for the building process.
An input database will be splitted into two parts: the first will be used for training and the second will be used for testing. The first will be size of TRAINING.SIZE
= \(200\) patients.
The PROBE.SIZE.NEGATIVE
= \(75\) and PROBE.SIZE.POSITIVE
= \(75\) constants are the sizes of samples of patients with benign and malignant tumors, respectively. The patients’ attributes for this samples will be obscured during the construcion of the training set.
The OBSCURE.PERCENTAGES
vector consists of considered levels of obscuration of the attributes in the simulation step. The values vary from \(0\) to \(0.5\).
The OBSCURE.MAX
= \(0.5\) constant indicates the maximum level of obscuration in the simulation step and maximum level of data incompleteness in the evaluation set.
The OBSUCRE.REPEAT
= \(1000\) constant is a number of repeats of random attributes obscuration in the simulation step.
All uncertaintyfied diagnostic models are implemented in methods.R
file. There are \(6\) models stored in METHODS
vector. Each of them for any patient (regardless of missing data) produces an interval contained in \([0,1]\). This interval should be interpreted as a set of all possible diagnoses which can be returned by this model for a given patient. Due to missing data in patient’s description, the length of the interval grows. Single point interval indicates no missing data in the patient’s description. Values below \(0.5\) indicate that a tumor is benign; values above or equal to \(0.5\) indicate that a tumor is malignant.
The METHODS.NAME
vector consists of short printable names for the models. The METHODS.COL
vector contains names of patients’ attributes used by a model.
The COLS.ALL
vector is an unique set of all patients’ attributes used by the models. The COLS.SURE
vector has names of the attributes which are always available and will not be obscured in the simulation step. The COLS.OBSC
has names of the attributes which for a given patient might be incomplete; these attributes will be obscured during the construcion of the training set.
The COLS.SURE
vector looks as follows:
print(COLS.SURE)
[1] "HormoneReplacementTherapy" "Age" "PainAtExamination"
[4] "AgeAfterMenopause" "UterusRemoved"
Debugging functions are implemented in utils.R
file.
The following subsections describes the process of loading the data with patients, preprocessing and division into the training and test sets.
The original patient database is stored in a CSV file. Due to legal restrictions, this database cannot be published. The main objective of this document is to document the process of building the analytic dataset from this undisclosed database.
db = read.csv(DATABASE.LOCATION, header=TRUE)
db = db %>%
select(GivenId, MalignancyCharacter, one_of(COLS.ALL)) %>%
rename(PatientId=GivenId)
The initial database contains \(27\) variables and \(441\) cases. The basic structure of this database is following (observe that only important columns for this research are presented):
dbSummary = t(sapply(select(db, -PatientId),
function(x){c(
min=min(x, na.rm=T),
median=median(x, na.rm=T),
max=max(x, na.rm=T),
NAs=sum(is.na(x))
)}))
print(round(dbSummary, 2))
min median max NAs
MalignancyCharacter 0.00 0.00 2.0 12
OvarianCancerInFamily 0.00 0.00 1.0 88
HormoneReplacementTherapy 0.00 0.00 1.0 26
Age 13.00 48.00 84.0 0
ADimension 10.00 79.00 280.0 13
PainAtExamination 0.00 0.00 1.0 0
Ascites 0.00 0.00 1.0 21
PapBloodFlow 0.00 0.00 1.0 201
Solid 0.00 0.00 1.0 92
ASolidDimension 0.00 20.00 300.0 101
InnerWall 0.00 1.00 1.0 0
Shadow 0.00 0.00 1.0 114
Color 1.00 2.00 4.0 103
Septum 0.00 1.00 1.0 37
SmEchogenicity 0.00 3.00 4.0 36
Location 0.00 0.00 2.0 35
SmInnerWallThickness 0.00 1.00 2.0 41
TumorVolume 0.52 142.44 4186.4 23
Pap 0.00 1.00 1.0 51
APapDimension 0.00 2.00 51.0 177
SeptumThickness 0.00 2.00 11.0 153
AgeAfterMenopause 0.00 0.00 38.0 0
Ca125 0.53 39.53 6326.0 68
Ri 0.00 0.56 1.0 249
UterusRemoved 0.00 0.00 1.0 24
IotaQuality 1.00 4.00 6.0 28
The MalignancyCharacter
feature indicates whether a patient has a benign (\(0\)), malignant (\(1\)) or borderline (\(2\)) tumor. A few attributes mentioned in COLS.SURE
have missing values - this is caused by the fact that at the time of data collection physician did not know that such information should be obtained from patients.
The sonographic attributes are described in the paper: D. Timmerman, L. Valentin, et al (2000). Terms, definitions and measurements to describe the sonographic features of adnexal tumors: a consensus opinion from the International Ovarian Tumor Analysis (IOTA) Group. Ultrasound in Obstetrics & Gynecology, 16(5), 500-505.
In the first step we combine malignant with borderline tumors.
Secondly, the database does not contain some values which can be inferred from other ones. An example of such value is a dimension of papilary projections (APapDimension
), which is missing in case of no papilary projections in the tumor (Pap==0
). Such values are inserted in database.
Missing Ri
values are inserted either when a physician decided not to perform this examination due to its obvious result or in case of avascular endometriomas.
db[with(db, which(MalignancyCharacter == 2)), "MalignancyCharacter"] = 1
db[with(db, which(is.na(PapBloodFlow) & Pap == 0)), "PapBloodFlow"] = 0
db[with(db, which(is.na(APapDimension) & Pap == 0)), "APapDimension"] = 0
db[with(db, which(is.na(SeptumThickness) & Septum == 0)), "SeptumThickness"] = 0
db[with(db, which(grepl("^Sms", PatientId) & is.na(Ri))), "Ri"] = 1
db[with(db, which(grepl("^Sz", PatientId) & is.na(Ri) & Color == 1)), "Ri"] = 1
The database is splitted into two sets:
db.training
- it consists of \(200\) patients with the complete attributes; this data will be used to construct the training set,db.test
- it consists of patients with no more than \(50\%\) incomplete attribues and a few patients with complete attributes; this data will be used to construct the test set.The choose of the patients with complete attributes is done by use of a stratified simple random sampling without replacement.
db.completeCases = filter(db, complete.cases(db))
db.incompleteCases = filter(db, !complete.cases(db) &
complete.cases(db[, c("MalignancyCharacter", COLS.SURE)]))
inTrainingSet.benignSize = round(nrow(filter(db.completeCases, MalignancyCharacter==0)) *
TRAINING.SIZE/nrow(db.completeCases))
inTrainingSet.malignantSize = round(nrow(filter(db.completeCases, MalignancyCharacter==1)) *
TRAINING.SIZE/nrow(db.completeCases))
db.completeCases = arrange(db.completeCases, MalignancyCharacter) # sort before strata
inTrainingSet = strata(db.completeCases, "MalignancyCharacter",
c(inTrainingSet.benignSize, inTrainingSet.malignantSize),
"srswor")$ID_unit
inTestSet = apply(select(db.incompleteCases, one_of(COLS.OBSC)), 1,
function(row) {sum(is.na(row))/length(row) <= OBSCURE.MAX})
db.training = db.completeCases[inTrainingSet, ]
db.test = bind_rows(db.completeCases[-inTrainingSet, ], # add a few remaining compl. cases
db.incompleteCases[inTestSet, ])
The following figure shows a distribution of patients with regard to the percentage of missing values in the dataset:
The resulting db.training
has \(109\) benign and \(91\) malignant cases.
fdSummary = t(sapply(select(db.training, -PatientId),
function(x){c(
min=min(x, na.rm=T),
median=median(x, na.rm=T),
max=max(x, na.rm=T),
NAs=sum(is.na(x))
)}))
print(round(fdSummary, 2))
min median max NAs
MalignancyCharacter 0.00 0.00 1.00 0
OvarianCancerInFamily 0.00 0.00 1.00 0
HormoneReplacementTherapy 0.00 0.00 1.00 0
Age 13.00 47.00 84.00 0
ADimension 25.00 83.00 200.00 0
PainAtExamination 0.00 0.00 1.00 0
Ascites 0.00 0.00 1.00 0
PapBloodFlow 0.00 0.00 1.00 0
Solid 0.00 0.00 1.00 0
ASolidDimension 0.00 16.50 150.00 0
InnerWall 0.00 1.00 1.00 0
Shadow 0.00 0.00 1.00 0
Color 1.00 3.00 4.00 0
Septum 0.00 1.00 1.00 0
SmEchogenicity 0.00 3.00 4.00 0
Location 0.00 0.00 2.00 0
SmInnerWallThickness 0.00 1.00 2.00 0
TumorVolume 4.40 205.19 4186.40 0
Pap 0.00 1.00 1.00 0
APapDimension 0.00 1.00 51.00 0
SeptumThickness 0.00 2.00 8.00 0
AgeAfterMenopause 0.00 0.00 33.00 0
Ca125 0.53 57.84 4909.19 0
Ri 0.21 0.65 1.00 0
UterusRemoved 0.00 0.00 1.00 0
IotaQuality 1.00 4.00 6.00 0
The resulting db.test
has \(123\) benign and \(52\) malignant cases.
odSummary = t(sapply(select(db.test, -PatientId),
function(x){c(
min=min(x, na.rm=T),
median=median(x, na.rm=T),
max=max(x, na.rm=T),
NAs=sum(is.na(x))
)}))
print(round(odSummary, 2))
min median max NAs
MalignancyCharacter 0.0 0.00 1.0 0
OvarianCancerInFamily 0.0 0.00 1.0 39
HormoneReplacementTherapy 0.0 0.00 1.0 0
Age 14.0 44.00 82.0 0
ADimension 28.0 70.00 230.0 0
PainAtExamination 0.0 0.00 1.0 0
Ascites 0.0 0.00 1.0 1
PapBloodFlow 0.0 0.00 1.0 65
Solid 0.0 0.00 1.0 49
ASolidDimension 0.0 20.00 300.0 53
InnerWall 0.0 1.00 1.0 0
Shadow 0.0 0.00 1.0 58
Color 1.0 1.00 4.0 56
Septum 0.0 0.00 1.0 3
SmEchogenicity 0.0 3.00 4.0 3
Location 0.0 0.00 1.0 2
SmInnerWallThickness 0.0 0.50 2.0 5
TumorVolume 4.1 121.60 4186.4 0
Pap 0.0 1.00 1.0 11
APapDimension 0.0 0.00 51.0 26
SeptumThickness 0.0 0.00 11.0 11
AgeAfterMenopause 0.0 0.00 38.0 0
Ca125 4.3 29.35 3126.0 42
Ri 0.0 1.00 1.0 56
UterusRemoved 0.0 0.00 1.0 0
IotaQuality 1.0 3.00 6.0 2
The following figure shows a distribution of classes in the training and test sets:
Because we can not share the initial database, we will create and provide simplified preprocessed datasets. From this moment, by the training and test sets we mean the calculated outputs of the uncertaintified diagnostic models.
The same structure is used for both the training and test sets (training.data
and test.data
data frames, respectively). First \(4\) columns of a data frame indicate the following:
PatientId
- a patient’s identifier in the original database,ObscureLevel
- a percentage of a patient’s attributes with missing data,ObscureRepeat
- a number of the iteration for a given ObscureLevel
(in the simulation step),MalignancyCharacter
- an actual diagnosis for a given patient.Next 12 columns contain lower and upper bounds of a given uncertaintified diagnostic model.
The code below fills the training.data
data frame with values obtained by the application of the diagnostic models to the obscured patient data. For each obscuration level and each repeat, a subset of the database is randomly obscured. Then the output is calculated for the uncertaintified diagnostic models. The results are stored sequentialy in the data frame.
Observe that dataset contain also data without obscuration (ObscureLevel==0
for the top rows). Although some missing values can be filled like in Preprocessing step, for the simplicity and clarity of the simulation process we leave them as they are.
training.data = data.frame()
for (obsc.lvl in OBSCURE.PERCENTAGES)
{
for (obsc.rep in 1:OBSUCRE.REPEAT)
{
printDebug(paste(obsc.lvl, obsc.rep))
# sample the database to obtain balanced dataset
db.training.sample = bind_rows(sample_n(filter(db.training, MalignancyCharacter==0),
PROBE.SIZE.NEGATIVE),
sample_n(filter(db.training, MalignancyCharacter==1),
PROBE.SIZE.POSITIVE))
if (obsc.lvl > 0)
{
naMat = matrix(0, nrow=nrow(db.training.sample), ncol=length(COLS.OBSC))
# put NA values into the matrix to obtain required percentage
# of missing values
naMat[sample(1:length(naMat), floor(obsc.lvl*length(naMat)))] = NA
# set NA value in random places in patient database
# (NA added to any value results in NA)
db.training.sample[, COLS.OBSC] = db.training.sample[, COLS.OBSC] + naMat
}
results = do.call(cbind,
sapply(seq_along(METHODS),
function(m)
{
method = METHODS[[m]]
name = METHODS.NAME[[m]]
cols = METHODS.COL[[m]]
t(apply(db.training.sample[cols], 1, method))
},
simplify=FALSE))
training.data = bind_rows(training.data,
with(db.training.sample,
data.frame(PatientId,
ObscureLevel=obsc.lvl,
ObscureRepeat=obsc.rep,
MalignancyCharacter,
results)))
}
}
The sample structure of the resulting simulated dataset for the uncertaintified diagnostic models is following:
head(as.data.frame(training.data), n=3)
PatientId ObscureLevel ObscureRepeat MalignancyCharacter LR1.min LR1.max LR2.min LR2.max SM.min
1 Sms409 0 1 0 0.3256547 0.3256547 0.3199750 0.3199750 0.125
2 Sms413 0 1 0 0.3545262 0.3545262 0.3353151 0.3353151 0.125
3 Sms087 0 1 0 0.4717678 0.4717678 0.4552104 0.4552104 0.375
SM.max Tim.min Tim.max Alc.min Alc.max RMI.min RMI.max
1 0.125 0.04860564 0.04860564 0 0 0.119275 0.119275
2 0.125 0.03753158 0.03753158 0 0 0.000000 0.000000
3 0.375 0.15890362 0.15890362 0 0 0.000000 0.000000
Similarly, the code below fills the test.data
data frame with values obtained by the application of the diagnostic models to the patients with the incomplete features. In this case lack of data is an effect of a physician’s decision.
ObscureLevel
is calculated as the real percentage of missing data. ObscureRepeat
is fixed and set to \(1\).
results = do.call(cbind,
sapply(seq_along(METHODS),
function(m)
{
method = METHODS[[m]]
name = METHODS.NAME[[m]]
cols = METHODS.COL[[m]]
t(apply(db.test[cols], 1, method))
},
simplify=FALSE))
obscLevels = apply(select(db.test, one_of(COLS.OBSC)), 1,
function(row){round(sum(is.na(row))/length(row), 2)})
test.data = with(db.test, data.frame(PatientId,
ObscureLevel=obscLevels,
ObscureRepeat=1,
MalignancyCharacter,
results))
The sample structure of the resulting evaluation dataset for the uncertaintified diagnostic models is following:
tail(test.data, n=3)
PatientId ObscureLevel ObscureRepeat MalignancyCharacter LR1.min LR1.max LR2.min LR2.max SM.min
173 Sz421 0.15 1 1 0.7101402 0.7101402 0.82713077 0.8271308 0.7000
174 Sz443 0.10 1 1 0.4557590 0.4557590 0.40919018 0.4091902 0.5500
175 Sz304 0.35 1 0 0.2008450 0.6862746 0.08283644 0.6065683 0.1875
SM.max Tim.min Tim.max Alc.min Alc.max RMI.min RMI.max
173 0.850 0.1196401 0.7786164 0.3333333 0.5000000 0.00000 1.0000000
174 0.650 0.0394265 0.3611866 0.0000000 0.1666667 0.09675 0.0967500
175 0.375 0.0000000 0.7271850 0.0000000 0.5000000 0.00000 0.5348837
The constructed datasets are stored as separate CSV files.
write.csv(training.data, TRAINING.LOCATION, row.names=FALSE)
write.csv(test.data, TEST.LOCATION, row.names=FALSE)