Introduction
This page describes files which compose the research implementation on supporting medical diagnosis under incomplete data. The approach includes interval modeling of incomplete data, uncertaintification of classical models and aggregation of incomplete results. The evaluation of the approach uses medical data for ovarian tumor diagnosis, where the problem of missing data is commonly encountered.
Technical details
All scripts are written in R 3.1.2. The RStudio project is supervised by packrat software to maintain compatibility of R packages. Documents are generated with use of knitr.
Experiment at a glance
The research consists of 3 steps:
--< datasets/db-2015-04-30.csv
|
| STEP 1 STEP 2 STEP 3
|
| ############################## ######################################### #########################
| # make-datasets.Rmd # # training-and-evaluation.Rmd # # results-overview.Rmd #
| ############################## ######################################### #########################
| # # # # # #
----> make-datasets.R # ----> training-and-evaluation.R # ----> results-overview.R #
# | # | # | # | # #
# -> datasets/training.csv >--- # -> datasets/evaluation-output.RData >--- # #
# -> datasets/test.csv >--- # # # #
# # # # # #
############################## ######################################### #########################
| | |
-> make-datasets.html -> training-and-evaluation.html -> results-overview.html
Downloading the results
To view outputs of the experiment, run download-data.R
script. It will
download CSV datasets and binary RData output:
datasets/training.csv
,datasets/test.csv
,datasets/evaluation-output.Rdata
.
Reproducing the research
To prepare the software environment to the experiment, open
ovarian-tumor-aggregation.Rproj
file in RStudio in order to launch packrat and
download necessary libraries. The installation process may take from a few to several minutes.
Due to legal restrictions, the initial database datasets/db-2015-04-30.csv
can not be published.
Therefore, the first step is not reproducible. The remaining steps can be reproduced in two ways (A or B, see sections below). To reflect whole experiment, non-reproducible steps also will be mentioned.
A. Creating datasets and final results
To create only datasets and results, which can be further investigated, execute the following scripts:
make-datasets.R
(not reproducible),training-and-evaluation.R
.
Caution: running training-and-evaluation.R
is very time-consuming and extensively
absorbs computational resources; it is recommended to run it in environment with
32 x 2.0 GHz cores and at least 200 GB RAM in such setting; the calculation process should take
approximately 18 hours.
B. Creating datasets, final results and documents
To create the datasets, the results and additionaly generate the documentation (which explain the implementation
of the experiment and the results) launch in knitr
following .Rmd
files:
make-datasets.Rmd
(not reproducible),training-and-evaluation.Rmd
,results-overview.Rmd
.