SOCR ≫ | BPAD Website ≫ | BPAD GitHub ≫ |
This BPAD/SOCR case-study is focused on understanding clinical data (spreadsheets), imaging data (MRI), and extraction of imaging-derived biomarkers (morphometric measures). The joint clinical and imaging data can be used for prediction, classification, segmentation, modeling, and forecasting of kidney stones and tumors using imaging and clinical data.
This kidney case-study can be used for building some simple models (e.g., correlational or linear-models using the meta-data, or imaging models computing the size, volume, diameter, surface area etc. of the kidney mask volumes, which are small, see attached 3D NII.gz kidney stone segmentation-mask volume).
Technical Details
We’ll start with importing the clinical meta-data, which is available in a .csv file format on Canvas.
Kidney_dataset <- read.csv("https://umich.instructure.com/files/29594707/download?download_frd=1",
header = TRUE)
Below we describe the types of each variable in the clinical meta-data table and provide appropriate feature summaries for these quantitative and qualitative variables.
## Kidney_dataset
##
## 63 Variables 210 Observations
## --------------------------------------------------------------------------------
## case_id
## n missing distinct
## 210 0 210
##
## lowest : case_00000 case_00001 case_00002 case_00003 case_00004
## highest: case_00205 case_00206 case_00207 case_00208 case_00209
## --------------------------------------------------------------------------------
## age_at_nephrectomy
## n missing distinct Info Mean Gmd .05 .10
## 210 0 59 0.999 58.35 15.77 35.00 39.80
## .25 .50 .75 .90 .95
## 50.00 61.00 68.00 74.00 77.55
##
## lowest : 1 11 12 17 21, highest: 82 83 84 86 90
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 210 0 2
##
## Value female male
## Frequency 87 123
## Proportion 0.414 0.586
## --------------------------------------------------------------------------------
## body_mass_index
## n missing distinct Info Mean Gmd .05 .10
## 210 0 198 1 31.17 7.293 22.05 23.72
## .25 .50 .75 .90 .95
## 26.47 30.07 35.61 40.07 42.87
##
## lowest : 16.2 18.92 19.96 20.13 20.16, highest: 45.6 46.62 48.24 48.42 49.61
## --------------------------------------------------------------------------------
## comorbidities.myocardial_infarction
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 201 9
## Proportion 0.957 0.043
## --------------------------------------------------------------------------------
## comorbidities.congestive_heart_failure
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 204 6
## Proportion 0.971 0.029
## --------------------------------------------------------------------------------
## comorbidities.peripheral_vascular_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 201 9
## Proportion 0.957 0.043
## --------------------------------------------------------------------------------
## comorbidities.cerebrovascular_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 206 4
## Proportion 0.981 0.019
## --------------------------------------------------------------------------------
## comorbidities.dementia
## n missing distinct value
## 210 0 1 false
##
## Value false
## Frequency 210
## Proportion 1
## --------------------------------------------------------------------------------
## comorbidities.copd
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 199 11
## Proportion 0.948 0.052
## --------------------------------------------------------------------------------
## comorbidities.connective_tissue_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## comorbidities.peptic_ulcer_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## comorbidities.uncomplicated_diabetes_mellitus
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 172 38
## Proportion 0.819 0.181
## --------------------------------------------------------------------------------
## comorbidities.diabetes_mellitus_with_end_organ_damage
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 208 2
## Proportion 0.99 0.01
## --------------------------------------------------------------------------------
## comorbidities.chronic_kidney_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 194 16
## Proportion 0.924 0.076
## --------------------------------------------------------------------------------
## comorbidities.hemiplegia_from_stroke
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## comorbidities.leukemia
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## comorbidities.malignant_lymphoma
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 208 2
## Proportion 0.99 0.01
## --------------------------------------------------------------------------------
## comorbidities.localized_solid_tumor
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 181 29
## Proportion 0.862 0.138
## --------------------------------------------------------------------------------
## comorbidities.metastatic_solid_tumor
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 198 12
## Proportion 0.943 0.057
## --------------------------------------------------------------------------------
## comorbidities.mild_liver_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 206 4
## Proportion 0.981 0.019
## --------------------------------------------------------------------------------
## comorbidities.moderate_to_severe_liver_disease
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## comorbidities.aids
## n missing distinct value
## 210 0 1 false
##
## Value false
## Frequency 210
## Proportion 1
## --------------------------------------------------------------------------------
## smoking_history
## n missing distinct
## 210 0 3
##
## Value current_smoker never_smoked previous_smoker
## Frequency 34 100 76
## Proportion 0.162 0.476 0.362
## --------------------------------------------------------------------------------
## age_when_quit_smoking
## n missing distinct
## 188 22 36
##
## lowest : 22 24 25 26 29
## highest: 60 64 71 73 not_applicable
## --------------------------------------------------------------------------------
## pack_years
## n missing distinct Info Mean Gmd .05 .10
## 143 67 23 0.658 9.909 16.58 0 0
## .25 .50 .75 .90 .95
## 0 0 8 34 45
##
## Value 0.0 4.2 6.3 8.4 14.7 16.8 18.9 23.1 29.4 33.6 35.7
## Frequency 101 4 2 3 2 1 6 2 7 2 1
## Proportion 0.706 0.028 0.014 0.021 0.014 0.007 0.042 0.014 0.049 0.014 0.007
##
## Value 37.8 39.9 44.1 48.3 63.0 73.5 113.4 210.0
## Frequency 1 2 3 2 1 1 1 1
## Proportion 0.007 0.014 0.021 0.014 0.007 0.007 0.007 0.007
##
## For the frequency table, variable is rounded to the nearest 2.1
## --------------------------------------------------------------------------------
## chewing_tobacco_use
## n missing distinct
## 210 0 2
##
## Value never_or_not_in_last_3mo quit_in_last_3mo
## Frequency 209 1
## Proportion 0.995 0.005
## --------------------------------------------------------------------------------
## alcohol_use
## n missing distinct
## 210 0 3
##
## Value more_than_two_daily never_or_not_in_last_3mo
## Frequency 13 95
## Proportion 0.062 0.452
##
## Value two_or_less_daily
## Frequency 102
## Proportion 0.486
## --------------------------------------------------------------------------------
## intraoperative_complications.blood_transfusion
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 207 3
## Proportion 0.986 0.014
## --------------------------------------------------------------------------------
## intraoperative_complications.injury_to_surrounding_organ
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 207 3
## Proportion 0.986 0.014
## --------------------------------------------------------------------------------
## intraoperative_complications.cardiac_event
## n missing distinct value
## 210 0 1 false
##
## Value false
## Frequency 210
## Proportion 1
## --------------------------------------------------------------------------------
## hospitalization
## n missing distinct
## 210 0 15
##
## lowest : 0 1 10 11 2
## highest: 6 7 8 9 died_before_discharge
## --------------------------------------------------------------------------------
## ischemia_time
## n missing distinct
## 200 10 39
##
## lowest : 0 10 11 12 13
## highest: 5 50 54 8 not_applicable
## --------------------------------------------------------------------------------
## radiographic_size
## n missing distinct Info Mean Gmd .05 .10
## 210 0 84 0.999 4.772 3.211 1.678 2.000
## .25 .50 .75 .90 .95
## 2.500 4.000 5.600 9.611 11.455
##
## lowest : 1.2 1.4 1.5 1.6 1.66, highest: 13.2 13.9 14.1 15 16.2
## --------------------------------------------------------------------------------
## pathologic_size
## n missing distinct Info Mean Gmd .05 .10
## 210 0 77 0.999 4.716 3.217 1.500 1.990
## .25 .50 .75 .90 .95
## 2.400 3.900 5.975 9.200 11.065
##
## lowest : 0.8 1 1.2 1.3 1.4 , highest: 13 13.6 15 16 22.5
## --------------------------------------------------------------------------------
## malignant
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 18 192
## Proportion 0.086 0.914
## --------------------------------------------------------------------------------
## pathology_t_stage
## n missing distinct
## 210 0 7
##
## Value 0 1a 1b 2a 2b 3 4
## Frequency 18 92 41 9 3 44 3
## Proportion 0.086 0.438 0.195 0.043 0.014 0.210 0.014
## --------------------------------------------------------------------------------
## pathology_n_stage
## n missing distinct
## 210 0 3
##
## Value 0 1 X
## Frequency 90 6 114
## Proportion 0.429 0.029 0.543
## --------------------------------------------------------------------------------
## pathology_m_stage
## n missing distinct
## 210 0 3
##
## Value 0 1 X
## Frequency 93 17 100
## Proportion 0.443 0.081 0.476
## --------------------------------------------------------------------------------
## tumor_histologic_subtype
## n missing distinct
## 210 0 12
##
## lowest : angiomyolipoma chromophobe clear_cell_papillary_rcc clear_cell_rcc mest
## highest: papillary rcc_unclassified spindle_cell_neoplasm urothelial wilms
## --------------------------------------------------------------------------------
## tumor_necrosis
## n missing distinct
## 187 23 2
##
## Value false true
## Frequency 144 43
## Proportion 0.77 0.23
## --------------------------------------------------------------------------------
## tumor_isup_grade
## n missing distinct Info Mean Gmd
## 172 38 4 0.826 2.238 0.8374
##
## Value 1.00 1.99 2.98 4.00
## Frequency 26 93 39 14
## Proportion 0.151 0.541 0.227 0.081
##
## For the frequency table, variable is rounded to the nearest 0.03
## --------------------------------------------------------------------------------
## clavien_surgical_complications
## n missing distinct
## 209 1 7
##
## Value 0 1 2 3a 3b 4 5
## Frequency 149 27 15 6 7 4 1
## Proportion 0.713 0.129 0.072 0.029 0.033 0.019 0.005
## --------------------------------------------------------------------------------
## er_visit
## n missing distinct
## 208 2 2
##
## Value false true
## Frequency 191 17
## Proportion 0.918 0.082
## --------------------------------------------------------------------------------
## readmission
## n missing distinct
## 208 2 2
##
## Value false true
## Frequency 187 21
## Proportion 0.899 0.101
## --------------------------------------------------------------------------------
## estimated_blood_loss
## n missing distinct Info Mean Gmd .05 .10
## 209 1 43 0.994 314 339.9 25 50
## .25 .50 .75 .90 .95
## 75 200 400 800 1000
##
## lowest : 0 5 10 20 25, highest: 1200 1300 1500 1800 2000
## --------------------------------------------------------------------------------
## surgery_type
## n missing distinct
## 210 0 3
##
## Value laparoscopic open robotic
## Frequency 28 60 122
## Proportion 0.133 0.286 0.581
## --------------------------------------------------------------------------------
## surgical_procedure
## n missing distinct
## 210 0 2
##
## Value partial_nephrectomy radical_nephrectomy
## Frequency 140 70
## Proportion 0.667 0.333
## --------------------------------------------------------------------------------
## surgical_approach
## n missing distinct
## 210 0 2
##
## Value Retroperitoneal Transperitoneal
## Frequency 39 171
## Proportion 0.186 0.814
## --------------------------------------------------------------------------------
## operative_time
## n missing distinct Info Mean Gmd .05 .10
## 208 2 151 1 243 106.6 120.4 139.4
## .25 .50 .75 .90 .95
## 174.8 229.0 283.2 372.3 437.8
##
## lowest : 44 47 64 91 97, highest: 530 553 586 592 613
## --------------------------------------------------------------------------------
## cytoreductive
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 193 17
## Proportion 0.919 0.081
## --------------------------------------------------------------------------------
## positive_resection_margins
## n missing distinct
## 210 0 2
##
## Value false true
## Frequency 199 11
## Proportion 0.948 0.052
## --------------------------------------------------------------------------------
## last_preop_egfr.value
## n missing distinct
## 153 57 49
##
## lowest : >=90 17 38 41 42 , highest: 86 87 88 89 90
## --------------------------------------------------------------------------------
## last_preop_egfr.days_before_nephrectomy
## n missing distinct Info Mean Gmd .05 .10
## 153 57 54 0.998 23.39 23.93 1.0 2.0
## .25 .50 .75 .90 .95
## 6.0 15.0 36.0 58.2 69.0
##
## lowest : 1 2 3 4 5, highest: 75 76 77 83 90
## --------------------------------------------------------------------------------
## first_postop_egfr.value
## n missing distinct
## 157 53 59
##
## lowest : >=90 25 27 28 29 , highest: 88 89 9 90 age<16
## --------------------------------------------------------------------------------
## first_postop_egfr.days_before_nephrectomy
## n missing distinct Info Mean Gmd .05 .10
## 157 53 112 1 437.6 526.5 78.0 82.2
## .25 .50 .75 .90 .95
## 103.0 154.0 500.0 1162.2 2078.4
##
## lowest : 76 77 78 79 80, highest: 2317 2380 2393 2473 2503
## --------------------------------------------------------------------------------
## last_postop_egfr.value
## n missing distinct
## 157 53 60
##
## lowest : >=90 27 28 30 32 , highest: 88 89 9 90 age<16
## --------------------------------------------------------------------------------
## last_postop_egfr.days_before_nephrectomy
## n missing distinct Info Mean Gmd .05 .10
## 157 53 147 1 1048 818.1 149.8 197.6
## .25 .50 .75 .90 .95
## 442.0 932.0 1487.0 2256.8 2426.0
##
## lowest : 76 77 93 100 114, highest: 2499 2503 2584 2604 3070
## --------------------------------------------------------------------------------
## vital_status
## n missing distinct
## 210 0 2
##
## Value censored dead
## Frequency 189 21
## Proportion 0.9 0.1
## --------------------------------------------------------------------------------
## vital_days_after_surgery
## n missing distinct Info Mean Gmd .05 .10
## 210 0 185 1 956.8 875.2 15.45 23.80
## .25 .50 .75 .90 .95
## 271.25 790.50 1427.50 2288.10 2431.60
##
## lowest : 0 2 8 9 11, highest: 2539 2607 2611 2900 3070
## --------------------------------------------------------------------------------
## voxel_spacing.x_spacing
## n missing distinct Info Mean Gmd .05 .10
## 210 0 125 0.999 0.7966 0.1216 0.6565 0.6832
## .25 .50 .75 .90 .95
## 0.7183 0.7812 0.8682 0.9631 0.9766
##
## lowest : 0.4375 0.451172 0.518 0.546875 0.550781
## highest: 0.970703 0.976562 0.976562 0.984 1.041
## --------------------------------------------------------------------------------
## voxel_spacing.y_spacing
## n missing distinct Info Mean Gmd .05 .10
## 210 0 125 0.999 0.7966 0.1216 0.6565 0.6832
## .25 .50 .75 .90 .95
## 0.7183 0.7812 0.8682 0.9631 0.9766
##
## lowest : 0.4375 0.451172 0.518 0.546875 0.550781
## highest: 0.970703 0.976562 0.976562 0.984 1.041
## --------------------------------------------------------------------------------
## voxel_spacing.z_spacing
## n missing distinct Info Mean Gmd .05 .10
## 210 0 10 0.92 3.162 1.93 0.50 0.50
## .25 .50 .75 .90 .95
## 1.25 3.00 5.00 5.00 5.00
##
## Value 0.500 0.995 1.220 1.490 1.985 2.480 2.975 3.740 3.965 5.000
## Frequency 31 18 5 2 4 23 38 1 2 86
## Proportion 0.148 0.086 0.024 0.010 0.019 0.110 0.181 0.005 0.010 0.410
##
## For the frequency table, variable is rounded to the nearest 0.045
## --------------------------------------------------------------------------------
At the initial stages, it may be useful to ignore the comorbidity indicator variables (too many Boolean characteristics of comorbid conditions). A deeper analysis may need to explore these interdependencies.
Next, we perform some simple exploratory data analytics (EDA). These graphs are just examples of univariate and multivariate relations.
radiographic_size
by surgery_type
# pl <- ggplot(Kidney_dataset, aes(x=surgery_type, y=radiographic_size)) +
# geom_violin() + ggtitle("Distribution of radiographic_size by surgery_type") +
# labs(x = "Surgery Type", y = "Tumor Size") + theme(legend.position = "bottom") # +
# # geom_text(stat = 'count', aes(label = ..count..), vjust = 1)
# ggplotly(pl)
plot_ly(Kidney_dataset, x = ~surgery_type, y = ~radiographic_size, split = ~surgery_type,
type = 'violin', box = list(visible = T), meanline = list(visible = T)) %>%
layout(title="Distribution of radiographic_size by surgery_type",
xaxis=list(title="Surgery Type"), yaxis=list(title="Tumor Size"),
legend = list(title="Surgery Type", orientation = 'h'))
vital_days_after_surgery
by surgical_approach
and gender
plot_ly(Kidney_dataset, type = 'violin') %>%
add_trace(x = ~surgical_approach[Kidney_dataset$gender == 'female'],
y = ~vital_days_after_surgery[Kidney_dataset$gender == 'female'],
legendgroup = 'Yes', scalegroup = 'Yes', name = 'Female',
side='negative', box=list(visible=T), meanline=list(visible=T),
color = I("blue"), points = 'all', jitter = 0.5, pointpos = -1) %>%
add_trace(x = ~surgical_approach[Kidney_dataset$gender == 'male'],
y = ~vital_days_after_surgery[Kidney_dataset$gender == 'male'],
legendgroup = 'Yes', scalegroup = 'Yes', name = 'Male',
side='positive', box=list(visible=T), meanline=list(visible=T),
color = I("yellow"), points = 'all', jitter = 0.5, pointpos = +1) %>%
layout(title="Distribution of vital_days_after_surgery by surgical_approach and `ender",
xaxis=list(title="Surgical Approach"), yaxis=list(title="Vital days after surgery"),
legend = list(title="Surgical Approach", orientation = 'h')) #,
vital_days_after_surgery
and
body_mass_index
2D, 3D, and higher dimensional distributions provide more more holistic depiction of the underlying phenomena, see pub 1 and pub 2.
If the marginal components \(\{X_i\}\) are independent, computing the joint multivariate distribution involves simply the multiplication of the individual marginal probability distributions, i.e., \[f_{{\bf{X}}}({\bf{x}})=\prod_{i=1}^n{f_{X_i}(x_i)}\ .\]
When there are inter-dependencies between variables, computing the joint probability distribution \(f_{{\bf{X}}}({\bf{x}})\) from the corresponding marginal univariate distribution models, \(f_{X_i}(x_i)\), requires linking the marginal distributions via connecting copula function models. For a pair of density functions, \(f_X (x)\) and \(f_Y (y)\), and their corresponding cumulative probability distribution functions, \(F_X (x)\) and \(F_Y (y)\), the joint bivariate cumulative distribution function model, \(F_{XY} (x,y)\) can be expressed via a distribution copula \(C(∙,∙)\) function. Similarly, the bivariate copula density function model, \(f_{XY} (x,y)\) can be explicated in terms of some appropriate probability density copula function, \(c(∙,∙)\).
\[F_{XY} (x,y)=F_X (x)\times F_Y (y)\times
\underbrace{C(F_X (x),F_Y (y))}_{distribution\ copula},\]
\[f_{XY} (x,y)=f_X (x)\times f_Y (y)\times
\underbrace{c(f_X (x),f_Y (y))}_{density\ copula}.\]
Here is an example illustrating the joint bivariate distribution of
vital_days_after_surgery
and
body_mass_index
.
library(extraDistr)
library(pracma)
# marginal domains, # summary(Kidney_dataset$vital_days_after_surgery)
len <- 300
t <- 1:len
x_dom <- seq(0, 3070, length=len) # range of vital_days_after_surgery
y_dom <- seq(16, 50, length=len) # range of body_mass_index
rho <- 0.3 # initial dependence (correlation), -1 < rho = Corr(t,phi) < 1
# Base function definitions
# f(t), Kidney_dataset$vital_days_after_surgery
pdf_x <- density(Kidney_dataset$vital_days_after_surgery, bw = "sj")
f <- approxfun(pdf_x, rule=1:2) # density as an abstract function
fMax <- max(abs(pdf_x$y))
F1 <- function (x) { # CDF as an abstract function
# message(paste('input to F1 =',x))
if (x < 1) { val = 0 }
else if (x > len) { val = 1 }
else { val=integral(f, xmin=x_dom[1]-200, xmax=x_dom[x],
reltol=1e-12, method="Kronrod")
}
# val=integral(f, xmin=x_dom[1]-200, xmax=x, reltol=1e-12, method="Kronrod")
return(val)
} # F1(100)
# g(phi), Kidney_dataset$body_mass_index
pdf_y <- density(Kidney_dataset$body_mass_index)
g <- approxfun(pdf_y, rule=1:2) # density as an abstract function
gMax <- max(abs(pdf_y$y))
G1 <- function (y) {
if (y < 1) { val = 0 }
else if (y > len) { val = 1 }
else { val = integral(g, y_dom[1], y_dom[y], method = "Kronrod") }
return(val)
} # G1(100)
# Gauss-copula
# C_{XY}^{Gauss} (x,y)=\frac{1}{\sqrt{1-ρ^2}} \times \exp \left (-\frac{ρ^2 (a^2+b^2 )-2abρ)}{2(1-ρ^2 )} \right ),
# where the pairwise correlation $-1 < \rho < 1$ and
# a(x)= \sqrt{2}\times erf^{-1} (2\times F_X (x)-1),
# b(y)= \sqrt{2}\times erf^{-1} (2\times F_Y (y)-1),
# and the error function is
# $$erf^{-1}(s)= \frac{2}{\sqrt{\pi}}\int_{0}^{\infty}{e^{-t^2}dt}.$$
CGauss <- function (t=len/2, s=len/2) {
if (t < 1) { A= -1e+9 }
else if (t >= len) { A= 1e+9 }
else { A = sqrt(2) * erfinv(2*(F1(t))-1) }
if (s < 1) { B= -1e+9 }
else if (s > len) { B= 1e+9 }
else { B = sqrt(2) * erfinv(2*(G1(s))-1) }
# message(paste('A=', A, 'B=', B))
val = (1/(sqrt(1-rho^2))) *
exp(-(rho^2*(A^2+B^2) - 2*A*B*rho)/(2*(1-rho^2)))
if (is.finite(val)) return (val)
else return (0)
} # CGauss(1, 17)
# F(t)
fJoint <- function (t=len/2, s=len/2, rho = rho) {
return ( f(x_dom[t]) * g(y_dom[s]) * CGauss(t, s) )
} # fJoint(1,18)
### Part 1: Classical Time Dynamics: $\phi=0$.
# plot(s, f(s)/fMax)
# plot(s, g(s)/gMax)
# # plot(s, 2*g(rep(0, length(s)))/gMax)
# # plot(s, log(F(t=s, phi=0, rho=0.1)))
# plot(s, log(F(t=s, phi=rep(0, length(s)), rho=0.1)))
# plot(s, CGauss(s))
library(plotly)
y1 <- 1000*f(x_dom) # f: first marginal
y2 <- 10* g(y_dom) # g: second marginal
y3 <- c() # Gaussian Copula (. , s=sMax/2)
for (i in 1:len) y3[i]=CGauss(i, 55) # or y_dom[len/2]
y4 <- c() # joint density f(t,t)
for (i in 1:len) y4[i] <- 10000*fJoint(t=i, s=55, rho)
plot_ly() %>%
add_trace(x=~t, y=~y1, type='scatter', mode='markers+lines',
name="f: first marginal") %>%
add_trace(x=~t, y=~y2, type='scatter', mode='markers+lines',
name="g: second marginal", opacity=1) %>%
add_trace(x=~t, y=~y3, type='scatter', mode='markers+lines',
name="C_Gauss(t,s): Gaussian Copula", opacity=0.5) %>%
add_trace(x=~t, y=~y4, type='scatter',
mode='markers+lines',
name="f(t,s, rho)): Joint Density") %>%
layout(title = "Copula-based Modeling of Joint Distribution",
xaxis = list(title = "time"),
yaxis = list(title = "Functional Value"), # showlegend = FALSE,
legend = list(orientation = 'h'),
scene = list(aspectmode = "manual",
aspectratio = list(x=1, y=1)))
Next we can display the raw and smoothed versions of the joint distribution as flat 2D images.
#### Path Kynamics over curves \gamma:\mathbb{R}\to \mathbb{C}
### Part 2: Path Kynamics over curves \gamma:\mathbb{R}\to \mathbb{C}
### Kime tube gamma(k1,k2) on torus ########## CARTESIAN
#### Parametric curve on torus
# define kime-surface
xsurf <- x_dom[t] # longitudinal trajectory
ysurf <- y_dom[t] # phase trajectory
zsurf <- mat<-matrix(0, nrow=len, ncol=len) # joint density f(t,s)
for (i in 1:len) {
for (j in 1:len) {
zsurf[i,j] <- 100000*fJoint(t=i, s=j, rho)
}
if (i %% 30 ==0) message(paste0('Progress: ... ', round((100*i)/len, 0), '%'))
}
# image(zsurf)
plot_ly(z= zsurf, type="heatmap") # %>% hide_colorbar()
## Multi-core parallelization
# library(foreach)
# library(doParallel)
# nc = detectCores()
# cl = makeCluster(nc-2, typr = 'FORK')
# registerDoParallel(cl)
# zsurf1 <- mat<-matrix(0, nrow=len, ncol=len)
# zsurf1 <- foreach (i = 1:len, .combine = 'c') %dopar% {
# for (j in 1:len) {
# 100000*fJoint(t=i, s=j, rho)
# }
# # if (i %% 30 ==0) message(paste0('Progress: ... ', round((100*i)/len, 0), '%'))
# }
# stopImplicitCluster()
# stopCluster(cl)
# image(zsurf)
# Clean zsurf, as %> table(is.na(zsurf)) # FALSE TRUE # 9911 89
# and table(is.infinite(zsurf)) # FALSE TRUE 9920 80
# replace missing values in zsurf by column (MARGIN=2) average
# zsurfSmooth <- sweep(zsurf, MARGIN = 2,
# STATS = apply(zsurf, MARGIN = 2, median, na.rm=TRUE),
# FUN = function(x,s) ifelse(is.na(x), s, x))
# zsurfSmooth2 <- sweep(zsurfSmooth, MARGIN = 2,
# STATS = apply(zsurfSmooth, MARGIN = 2, median, na.rm=TRUE),
# FUN = function(x,s) ifelse(is.infinite(x), s, x))
# image(zsurfSmooth2)
The next plot renders the joint distribution in the 2D Cartesian
plane. The display shows the 2D density surface overlayed with a pair of
canonical curves corresponding to a fixed
vital_days_after_surgery
, \(t=100\) (orange) and a fixed time
body_mass_index
, \(s=s(=22)\) (green).
# 3D Kime-surface Plot
f <- list(family = "Courier New, monospace", size = 18, color = "black")
xlab <- list(title = "post surgery vitality", titlefont = f)
ylab <- list(title = "BMI", titlefont = f)
zlab <- list(title = "Density surface", titlefont = f)
plot_ly(x=~xsurf, y=~ysurf, z=~t(zsurf), type="surface", opacity=0.95,
showlegend=FALSE, name="Joint Density-Surface") %>%
# add vital days after surgery projection curve, for a fixed BMI=22
add_trace(x=~xsurf, y=ysurf[len/2], z=~zsurf[, len/2], type="scatter3d",
mode="lines", line=list(width=20), name="BMI=22") %>%
# add BMI projection curve vital days after surgery=1000
add_trace(x=xsurf[100], y=~ysurf, z=~zsurf[100,], type="scatter3d",
mode="lines", line=list(width=20), name="1000 days post surgery") %>%
hide_colorbar() %>%
layout(title = "Joint Bivariate Density (Gaussian copula, Rho=0.3)",
scene = list(aspectmode = "manual",
xaxis = xlab, yaxis = ylab, zaxis = zlab,
aspectratio = list(x=1, y=1, z=1))) # 1:1:1 aspect ratio
Quantitative data analytics (QDA) represent are next in the natural order of extracting actionable information from complex data. Below are several illustrations of using a data-frame to store the clinical kidney data. Specifically, we will fit classical statistical models (e.g., GLM with binary or quantitative outcomes). Mode advanced analytical strategies employ modern ML/AI for clustering, classification, prediction, forecasting, or segmentation of the study participants. Clearly these demonstrations can be significantly expanded to reflect specific hypothesis-driven research inquiries, exploratory studies, confirmatory analytics, or other inferential approaches to extract actionable information of the observed kidney clinical information.
The participant survival (censored or dead indicators) are stored in
the factor variable Kidney_dataset$vital_status
, which
takes two possible values (censored=alive or
dead).
After randomly splitting the data into two sets (training
and testing), we fit a logistic (glm
) model using
the training data, assess the performance of the model on the
independent testing dataset, compute the receiver operating
characteristic (ROC) curve, and report a variety of performance metrics,
including area under the ROC curve (AUC).
# Model 1 -- may need to split 80:20 the data into training set (for model fitting) and
# testing set (for independent valuation)
Kidney_dataset$vital_status <- as.factor(Kidney_dataset$vital_status)
table(Kidney_dataset$vital_status)
##
## censored dead
## 189 21
# 80% training + 20% testing
set.seed(1234)
subset_int <- sample(nrow(Kidney_dataset), floor(nrow(Kidney_dataset)*0.8))
Kidney_dataset_train <- Kidney_dataset[subset_int, ]
Kidney_dataset_test <- Kidney_dataset[-subset_int, ]
# Build logistic regression model (based on training set)
Log.Reg.Model <- glm(vital_status ~ age_at_nephrectomy+tumor_isup_grade+estimated_blood_loss+
operative_time+pathologic_size+body_mass_index,
data = Kidney_dataset_train, family = "binomial")
summary(Log.Reg.Model)
##
## Call:
## glm(formula = vital_status ~ age_at_nephrectomy + tumor_isup_grade +
## estimated_blood_loss + operative_time + pathologic_size +
## body_mass_index, family = "binomial", data = Kidney_dataset_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.925453 2.573797 -2.302 0.02132 *
## age_at_nephrectomy 0.015919 0.023118 0.689 0.49108
## tumor_isup_grade 1.154501 0.396964 2.908 0.00363 **
## estimated_blood_loss -0.001089 0.001182 -0.921 0.35680
## operative_time 0.001753 0.003293 0.532 0.59453
## pathologic_size 0.081326 0.081302 1.000 0.31717
## body_mass_index -0.019959 0.052095 -0.383 0.70163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.771 on 136 degrees of freedom
## Residual deviance: 77.690 on 130 degrees of freedom
## (31 observations deleted due to missingness)
## AIC: 91.69
##
## Number of Fisher Scoring iterations: 6
## $N
## [1] 137
##
## $R2
## [1] 0.2776233
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -38.8450735 -49.3853113 21.0804755 0.2134286 0.1426183 0.2776233
# Prediction (using testing data)
Log.Reg.Predict <- predict(Log.Reg.Model, Kidney_dataset_test, type = "response")
# Check Accuracy
library(caret)
# Confusion Matrix
confusionMatrix(data = factor(as.factor(Log.Reg.Predict <= 0.02), labels=c("censored", "dead")),
reference = Kidney_dataset_test$vital_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction censored dead
## censored 27 2
## dead 2 1
##
## Accuracy : 0.875
## 95% CI : (0.7101, 0.9649)
## No Information Rate : 0.9062
## P-Value [Acc > NIR] : 0.8239
##
## Kappa : 0.2644
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9310
## Specificity : 0.3333
## Pos Pred Value : 0.9310
## Neg Pred Value : 0.3333
## Prevalence : 0.9062
## Detection Rate : 0.8438
## Detection Prevalence : 0.9062
## Balanced Accuracy : 0.6322
##
## 'Positive' Class : censored
##
# print(paste('Accuracy is ', 1 - MisClassificError))
# Area Under Curve
library(ROCR)
# replace NA predictions to 0 (trivial) probabilities
Log.Reg.Predict <- Log.Reg.Predict %>% replace(is.na(.), 0)
TestDataDependentVar <- Kidney_dataset_test$vital_status
pred <- ROCR::prediction(Log.Reg.Predict, TestDataDependentVar)
perf <- ROCR::performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
plot(perf, col = "black", lty = 2, lwd = 2)
abline(a = 0, b = 1, col = "red")
# Calculating AUC
auc <- ROCR::performance(prediction.obj = pred, measure = "auc")
auc <- as.numeric(auc@y.values)
minauc <- (round(auc, digits = 2))
minauct <- paste(c("min (AUC) = "), minauc, sep = "")
legend(0.52, 0.5, c(minauct, "\n"), cex = 1, box.col = "Red")
Now we can bring in the 3D volumetric (MRI) data. We will import and display the 3D imaging data for just one participant. Note that these 3D volumes are rather large (approximate tensor dimensions are \(602\times 512\times 512\)). The imaging data includes the abdominal 3D high-resolution MRI volume and the corresponding kidney mask and kidney-tumor mask. The imaging data can be used to obtained imaging-derived morphometric measures of the masks. These 3D volumetric masks can capture the overall kidney morphology, anatomy, physiology and to some extend function, along with any expert-delineated kidney-tumors, nephrologist-identified kidney-stones, etc. In practice such derived morphometrics (e.g., mask diameter, volume, surface area, boundary shape, complexity, etc.) can be further associated with the remaining clinical data phenotypes. Such translational studies are commonly referred to as “phenomics studies”.
Let’s explore the raw kidney volume of the first participant by displaying some 2D cardinal-projection panes, specially chosen to cut the 3D abdominal MRI volume through the kidneys.
# load the raw kidney image for the first participant
library(oro.nifti)
Kidney_image_3D <- "https://umich.instructure.com/files/29648559/download?download_frd=1"
Kidney_image_3D_File <- file.path(tempdir(), "Kidney_image_3D.nii.gz")
download.file(Kidney_image_3D, dest=Kidney_image_3D_File, mode = "wb", quiet=F)
Kidney_image_3D_vol <- readNIfTI(Kidney_image_3D_File, reorient=FALSE) # oro.nifti::readNIfTI()
# some simple 3D MRI volume descriptors
hist(Kidney_image_3D_vol)
## Formal class 'nifti' [package "oro.nifti"] with 46 slots
## ..@ .Data : num [1:602, 1:512, 1:512] -1024 -1024 -1024 -1024 -1024 ...
## ..@ sizeof_hdr : int 348
## ..@ data_type : chr ""
## ..@ db_name : chr ""
## ..@ extents : int 0
## ..@ session_error : int 0
## ..@ regular : chr ""
## ..@ dim_info : chr ""
## ..@ dim_ : int [1:8] 3 602 512 512 1 1 1 1
## ..@ intent_p1 : num 0
## ..@ intent_p2 : num 0
## ..@ intent_p3 : num 0
## ..@ intent_code : int 0
## ..@ datatype : int 64
## ..@ bitpix : int 64
## ..@ slice_start : int 0
## ..@ pixdim : num [1:8] 1 0.5 0.799 0.799 1 ...
## ..@ vox_offset : num 352
## ..@ scl_slope : int 1
## ..@ scl_inter : int 0
## ..@ slice_end : int 0
## ..@ slice_code : int 0
## ..@ xyzt_units : int 0
## ..@ cal_max : num 1393
## ..@ cal_min : num -1024
## ..@ slice_duration: num 0
## ..@ toffset : num 0
## ..@ glmax : num 1393
## ..@ glmin : num -1024
## ..@ descrip : chr ""
## ..@ aux_file : chr ""
## ..@ qform_code : int 0
## ..@ sform_code : int 2
## ..@ quatern_b : num -0.707
## ..@ quatern_c : num 0
## ..@ quatern_d : num 0.707
## ..@ qoffset_x : num 0
## ..@ qoffset_y : num 0
## ..@ qoffset_z : num 0
## ..@ srow_x : num [1:4] 0 0 -0.799 0
## ..@ srow_y : num [1:4] 0 -0.799 0 0
## ..@ srow_z : num [1:4] -0.5 0 0 0
## ..@ intent_name : chr ""
## ..@ magic : chr "n+1"
## ..@ extender : int [1:4] 0 0 0 0
## ..@ reoriented : logi FALSE
## ..$ dim: int [1:3] 602 512 512
## [1] 602 512 512
# Load the tumor mask volume
Kidney_mask_3D <- "https://umich.instructure.com/files/29642168/download?download_frd=1"
Kidney_mask_3D_File <- file.path(tempdir(), "Kidney_mask_3D.nii.gz")
download.file(Kidney_mask_3D, dest=Kidney_mask_3D_File, mode = "wb", quiet=F)
Kidney_mask_3D_vol <- readNIfTI(Kidney_mask_3D_File, reorient=FALSE)
dim(Kidney_mask_3D_vol) # 602 512 512
## [1] 602 512 512
# vol-dimensions
xdim <- dim(Kidney_image_3D_vol)[1] # Z
ydim <- dim(Kidney_image_3D_vol)[2] # Y
zdim <- dim(Kidney_image_3D_vol)[3] # X
# cross-hairs of 2D cardinal projection planes, mind the negative (reversed) step-sizes
x = 376 #as.integer(xdim/2) # 602-380
y = 224 # as.integer(ydim/2)
z = 334 # as.integer(zdim/2)
library(oro.nifti)
orthographic(Kidney_image_3D_vol, Kidney_mask_3D_vol, zlim.y=c(0.5, 2.5),
xyz=c(xdim-x,ydim-y,zdim-z), zlim=range(Kidney_image_3D_vol)*0.9)
# z slice
# img <- (1200+Kidney_image_3D_vol[as.integer(dim(Kidney_image_3D_vol)[1]/2),,])/10
# mode(img) <- "integer"; str(img)
# plot_ly(z=~img, type="heatmap")
z2D_img <- (1200+Kidney_image_3D_vol[z,,])/10 # 380
mode(z2D_img) <- "integer"
z2D_mask <- Kidney_mask_3D_vol[z,,]
# z2D_mask <- (Kidney_mask_3D_vol[as.integer(dim(Kidney_mask_3D_vol)[1]/2),,])
# vertical flip of 2D arrays (matrices) to show images in standard anatomical space
# z2D_img <- z2D_img[c((dim(z2D_img)[2]):1),,drop = FALSE]
# z2D_mask <- z2D_mask[c((dim(z2D_mask)[2]):1),,drop = FALSE]
# Each mask (Kidney_mask_3D_vol, segmentation.nii.gz) has background (0), kidney-mask (1), and tumor-mask(2)
table(z2D_mask)
## z2D_mask
## 0 1 2
## 253733 8189 222
# 0 1 2
# 258312 3657 175
# zslice <- ggplotly(ggplot() + geom_hline(yintercept = y, color = "red") +
# geom_vline(xintercept = x, color = "red") + xlim(0, xdim) + ylim(0, ydim)) %>%
# # add_contour(z = ~(255*z2D_mask), name="Axial Mask") %>%
# add_contour(z = ~z2D_img, name="Axial Image", opacity=0.3) %>% hide_colorbar()
# # x slice
# x2D_img <- t(Kidney_image_3D_vol[x, , ])
# xslice <- ggplotly(ggplot() + geom_hline(yintercept = z, color = "red") +
# geom_vline(xintercept = y, color = "red") + xlim(0, ydim) + ylim(0, zdim)) %>%
# add_contour(z = ~x2D_img, name="Sagittal Plane") %>% hide_colorbar()
# # y slice
# y2D_img <- t(Kidney_image_3D_vol[, y, ])
# yslice <- ggplotly(ggplot() + geom_hline(yintercept = z, color = "red") +
# geom_vline(xintercept = x, color = "red") + xlim(0, xdim) + ylim(0, zdim)) %>%
# add_contour(z = ~y2D_img, name="Coronal Plane") %>% hide_colorbar()
# # combine all 4 plots:
# subplot(zslice, xslice, yslice, nrows = 3)
# Define crosshairs
hline <- function(y = 0, color = "red") {
list(type = "line", x0 = 0, x1 = 1, xref = "paper", y0 = y, y1 = y,
line = list(color = color, dash="dot"))
}
vline <- function(x = 0, color = "red") {
list(type = "line", y0 = 0, y1 = 1, yref = "paper", x0 = x, x1 = x, line = list(color = color, dash="dot"))
}
# Axial View
plot_ly() %>%
add_trace(z=~z2D_img, type="contour", colorscale = "Greys", name="Axial Image", opacity=0.3) %>%
add_trace(z=~z2D_mask, type="heatmap", colorscale = "Reds", name="Axial Mask") %>%
# add_segments(x=x, y=0, xend=x, yend=dim(z2D_img)[1]) %>%
# add_segments(x=0, y=y, xend=dim(z2D_img)[2], yend=y) %>%
layout(title="Axial MRI Image and Kidney Mask", scene = list(aspectratio = list(x=1, y=1, z=1))) %>%
#, shapes=list(hline(198), vline(168)))
hide_colorbar()
# Sagittal View
z2D_img <- (1200+Kidney_image_3D_vol[,,x])/10
mode(z2D_img) <- "integer"
z2D_mask <- Kidney_mask_3D_vol[,,x]
plot_ly() %>%
add_trace(z=~z2D_img, type="contour", colorscale = "Greys", name="Sagittal Image", opacity=0.3) %>%
# add_trace(z=~z2D_mask, type="heatmap", colorscale = "Reds", name="Sagittal Mask") %>%
layout(title="Sagittal MRI Image and Kidney Mask") %>% hide_colorbar()
# shapes = list(hline(242), vline(305))) %>% hide_colorbar()
# Coronal View
z2D_img <- (1200+Kidney_image_3D_vol[,y,])/10
mode(z2D_img) <- "integer"
z2D_mask <- Kidney_mask_3D_vol[,y,]
plot_ly() %>%
add_trace(z=~z2D_img, type="contour", colorscale = "Greys", name="Coronal Image", opacity=0.3) %>%
add_trace(z=~z2D_mask, type="heatmap", colorscale = "Reds", name="Coronal Mask") %>%
layout(title="Coronal MRI Image and Kidney Mask") %>% hide_colorbar()
Let’s illustrate some simple examples of deriving imaging morphometry measures from the 3D imaging volumes. In essence, we will just compute the kidney and tumor volumes. More advanced metrics, such as regional boundary surface measures (e.g., surface area, curvature, fractal dimension, shape index, etc.), may also be computed, see some of the LONI/SOCR team morphometric analysis publications. Once the imaging morphometrics are computed, they can be added to the clinical data spreadsheet and used for subsequent exploratory or quantitative (confirmatory/hypothesis-driven) analytics. Recall that the 3D imaging MASKS data are available in NII.gz format on GitHub e.g., mask case 00001.
# Morphometrics calculating function, an example is the volume of the kidney and tumor
morphometrix <- function (volume) {
morpho_voxels <- table(volume) # Voxel counts: background, kidney-voxels, and tumor-voxels
# Volume (mm^3): background, kidney-volume, and tumor volume
pix_dim <- volume@pixdim[c(2:4)] # extract the x, y, and z step-sizes
morpho_volume <- morpho_voxels * prod(pix_dim) # Volumes: background, kidney-volume, and tumor-volume
return(morpho_volume)
}
# Test the morphometrix function
morphometrix(Kidney_mask_3D_vol)
## volume
## 0 1 2
## 49869481.630 474833.982 7265.388
# Download the kidney and tumor mask volume for a single participant
maskDownloader <- function (path) {
Kidney_mask_3D_File <- file.path(tempdir(), "Kidney_mask_3D.nii.gz")
download.file(path, dest=Kidney_mask_3D_File, mode = "wb", quiet=F)
return(readNIfTI(Kidney_mask_3D_File, reorient=FALSE))
}
library(stringr)
files <- paste0("https://github.com/neheller/kits19/blob/master/data/case_",
str_pad(0:209, width = 5, side = "left", pad = 0),
"/segmentation.nii.gz?raw=true")
# https://github.com/neheller/kits19/blob/master/data/case_00209/segmentation.nii.gz?raw=true
# Run the morphometrix for all 210 study participants
for (i in 1:dim(Kidney_dataset)[1]) { # 1:210 participants
# Download the mask
Kidney_mask_3D_vol <- maskDownloader(files[i])
# Compute the kidney-volume and tumor volume (mm^3) and augment the clinical meta-data: Kidney_dataset
morpho <- morphometrix(Kidney_mask_3D_vol)
Kidney_dataset$KidneyVolume[i] <- (morpho[2] + morpho[3]) # kidney + tumor volume!
Kidney_dataset$TumorVolume[i] <- morpho[3] # just the (smaller) kidney-tumor volume
}
# Save augmented data
write.csv(Kidney_dataset, file="C:/Users/dinov/Desktop/kits21_kidney_dataset_NI_augmented.csv")
# Try some data analytics using the clinical data augmented with imaging-derived morphometric data ...
Try some simple model-based or model-free data analytics using the clinical data augmented with imaging-derived morphometric data.