SOCR ≫ DSPA ≫ Topics ≫


1 How to navigate the DataSifter site

This is a dynamic webpage that allows interactive navigation via ToC (top-left) and tabs (horizontal and vertical hide/expand segments). By clicking on each blue hyperlinks in the TOC, the reader can navigate between different sections. By clicking on the tabs the reader can navigate within the many parts of each section. The video below illustrates the main features of the Data Sifter protocol.


2 Motivation

Currently, there are no practical, scientifically reliable, and effective mechanisms to share real clinical data containing no clearly identifiable personal health information (PHI) without compromising either the value of the data (by excessively scrambling/encoding the information) or by introducing a substantial risk for re-identification of individuals by various stratification techniques.


2.1 Background

We developed a novel method and a protocol for on-the-fly de-identification of sensitive information, e.g., structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, the data-governor may specify that either null data or completely identifiable data is generated and shared with the data-requester. This decision may be based on data-governor determined criteria abut access level, research needs, etc. For instance, to stimulate innovative pilot studies, the data office may dial up the level of protection (which may naturally devalue the information content in the data), whereas for more established and trusted investigators, the data governors may provide a more egalitarian dataset that balances preservation of information content and sensitive-information protection.

In a nutshell, responding to requests by researchers interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data, the DataSifter allows data governors, like Healthcare Systems, to filler, export, package and share sensitive clinical and medical data for large population cohorts.


2.2 Technology

The DataSifter protocol is based on an algorithm that involves (data-governor controlled) iterative data manipulation that stochastically identifies candidate entries from the cases (e.g., subjects, participants, units) and features (e.g., variables or data elements) and subsequently selects, nullifies, and imputes the information. This process heavily relies on statistical multivariate imputation to preserve the joint distributions of the complex structured data archive. At each step, the algorithm generates a complete dataset that in aggregate closely resembles the intrinsic characteristics of the original cohort, however, on an the individual level (e.g., rows), the data are substantially obfuscated. This procedure drastically reduces the risks for subject re-identification by stratification, as meta-data for all subjects is repeatedly and lossily encoded. A number of techniques including mathematical modeling, statistical inference, probabilistic (re)sampling, and imputation methods are embedded in the DataSifter information obfuscation protocol.

A detailed flow chart illustrating the imputation and obfuscation steps of the DataSifter protocol is shown in Figure 1.

Figure 1: Diagram of Data Sifter protocol for Imputation and Obfuscation

Figure 1: Diagram of Data Sifter protocol for Imputation and Obfuscation


2.3 Applications

The main applications of the DataSifter include:

  • Sharing electronic medical or health records (EMR/EHR): This allows researchers and non-clinical investigators access to sensitive data and promote scientific modeling, rapid interrogation, exploratory, and confirmatory analytics, as well as discovery of complex biomedical processes, health conditions, and biomedical phenotypes.

  • Sharing Biosocial Data: Allowing engineers and data scientists to examine sensitive CMS/Medicare/Census/HRS data without compromising the risk for participant re-identification.

  • Other government data (e.g., IRS) may similarly be shared as DataSifter outputs without privacy concerns.


2.4 Advantages

The new method has some advantages:

  • This algorithm permits computationally efficient implementations - a feature that is attractive for data governors that deal with large amounts of data, receive many data requests, and need innovative strategies for data interrogation.

  • On the individual level, the DataSifter scrambles sensitive information that can be used for stratification based data re-identification attacks, thus reducing security risks. At the same time, it preserves the joint distribution of the entire cohort-based data, thus, facilitating the urgent need to expedite data interpretation, derive actionable knowledge, and enable rapid decision support).

  • As the data governors can keep their mapping between the native subject identifiers (e.g., EMR, SSN) and the study-specific subject IDs (sequential or random), the size and complexity of the data collection may easily be extended to add additional longitudinal data augmenting previously generated DataSifter output datasets. This allows a mechanism for meaningful aggregation of obfuscated data.


2.5 Software/materials

In the implementation of DataSifter, we used SOCR libraries (http://socr.umich.edu/html/SOCR_CitingLicense.html) and R software (https://www.r-project.org/Licenses/). More information is provided at our GitHub repository.

To install DataSifter lite, run the following two commands in your R/RStudio shell:

# install.packages("devtools")  # just in case you don't already have the R devtools package installed.
library(devtools)
install_github("SOCR/DataSifter")

The DataSifter.lite demo() function may be called to run a quick simulation that will illustrate the typical inputs, processing steps, intermediate and final outputs, numerical results, and visualization plots.

# Make sure you first install the `DataSifter.lite` packages as shown in the Installation section
library(DataSifter.lite)
# to check that the DataSifter.lite demo function, `DataSifter_func()`, is there try:
demo(package=.packages(all.available = TRUE))
demo(DataSifter_func)

## OUTPUT OF THE DEMO
    demo(DataSifter_func)
    ---- ~~~~~~~~~~~~~~~

Type  <Return>   to start : 

> ### DataSifter small application
> library(DataSifter.lite)

> #Generate original data
> set.seed(1234)

> x1<-runif(1000)

> x2<-runif(1000) 

> x3<-runif(1000)

> x4<-runif(1000)

> x5<-runif(1000)

> data1<-data.frame(x_1=x1,x_2=x2,x_3=x3,x_4=x4,x_5=x5)

> data1$y=1+x1+x2-0.5*x3-2*x4+0.5*x5

> #Generate "sifted" data with "medium" level of obfuscation
> sifteddata <- dataSifter(level = "medium",data=data1,nomissing = TRUE)
  missForest iteration 1 in progress...done!
  missForest iteration 1 in progress...done!
[1] "Artifical missingness and imputation done"
[1] "Obfuscation step done"

> #Calculate Percent Identical Feature Values(PIFV)
> PIFV <- pctMatch(data1,sifteddata)

> summary(PIFV)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.3333  0.5000  0.4678  0.6667  0.8333 

> #Linear model for testing data utility
> model <- lm(y~x1+x2+x3+x4+x5,data = sifteddata)

> summary(model)$coefficients
              Estimate Std. Error   t value      Pr(>|t|)
(Intercept)  1.0726348 0.03079286  34.83388 2.113269e-174
x1           0.7638591 0.02637227  28.96448 3.079890e-134
x2           0.7736551 0.02730057  28.33842 5.757082e-130
x3          -0.3468092 0.02688633 -12.89909  2.605116e-35
x4          -1.7138182 0.02647332 -64.73756  0.000000e+00
x5           0.3520868 0.02630641  13.38407  1.107224e-37

2.6 Notes

  • In terms of implementation, repeated user requests for the same cohort data (e.g., different \(\eta\) values) should be discouraged and viewed with suspicion. Multiple copies of different iterations of DataSifter-generated data of the same cohort may be cleverly mined and merged in an effort to reconstruct the original (sensitive) data.
  • The GitHub DataSifter repository provides additional information about this new technology.
  • Handling unstructured data, strings or non-ordinal data: By default, the DataSifter randomly swaps such features between cases/participants, subject to determining pairs of cases that are close to each other using appropriate distance metrics defined on the structured quantitative features in the data. More advanced techniques using NLP and ML methods may be employed to transform unstructured data into structured data elements that can be jointly utilized in the DataSifting process.

3 Data Sifter Technical details

We outline below some basic features for the user of the Data Sifter protocol. Other sections will illustrate more technical specifications of the Data Sifter algorithm as well as its mathematical formulation. For all the details on the Data Sifter algorithm see the published manuscript.


3.1 The obfuscation slider \(\eta\)

The Data Sifter can obfuscate any dataset based on inputs from the users. We devised a slider \(\eta\) that varies between 0 and 1. If the users want no obfuscation, they can set \(\eta = 0\). If the users want a completely obfuscated dataset (completely synthetic dataset), they can set \(\eta = 1\). Anything in between will adjust the cases/features obfuscation accordingly. The closer \(\eta\) is to 1, the more obfuscation will be generated.

The DataSifter package allows for 5 levels of pre-built obfuscation levels: none, small, medium, large and indep (independent or completely random/synthetic dataset).

Below is the complete and detailed description of the datasifter() function.

Description

The full version of datasifter() function creates an informative privacy-preserving dataset that guarantees subjects’ privacy while preserving the information contained in the original dataset.

Usage

dataSifter(level = “indep”, data, unstructured.names = NULL, subjID = NULL, batchsubj = 1000, missingback = FALSE, usecore = 2, col_preserve = 0.5, col_pct = 0.7, nomissing = FALSE, k0 = NA, k1 = NA, k2 = NA, k3 = NA, k4 = NA, maxiter = 1)

Arguments

  • level: takes a value among (“none”,“small”,“medium”,“large”,“indep”). The user-defined level of obfuscation for the data sifter algorithm. Greater value represents higher level of obfuscation. The default value is “indep” with most obfuscation, which produces independent variables that follows the imperical distribution of the original data.

  • data: original data to be processed.

  • unstructured.names}*: A vector of character indicating unstructured (rich text) variable names in the data.

  • subjID: vector of characters indicating the variables for subject ID. These variables will be removed for privacy protection.

  • col_preserve: the maximum percentage of number of columns can be deleted due to massive missingness.

  • batchsubj: Number of subjects per batch for parallel computing.

  • col_pct : criterion for column deletion due to massive missingness. If missing percentage is larger than this threshold, delete the corresponding column.

  • missingback: Indicator for whether put the original missingness back to the sifted data. The default is FALSE.

  • usecore: Number of cores to use for the parallel imputation step. The default is 2 cores.

  • col_preserve: The maximum percentage of number of columns can be deleted due to massive missingness.

  • col_pct: Criterion for column deletion due to massive missingness. If missing percentage is larger than this threshold, delete the corresponding column.

  • nomissing: Indicator of missing in the original dataset. If nomissing=TRUE, there are no missing in the original data.

  • k0: (optional userdefined level of privacy) defining k0-k4 will dominate the “level” parameter. k0 is a binary option [0,1] to swap/obfuscate or not the unstructured feature defined in list4. It is not relevant for computing distance between cases.

  • k1: (optional userdefined level of privacy) defining k0-k4 will dominate the “level” parameter. k1 defines percent of artificial missing values introduced and later imputed by the algorithm. We suggest to using values between 0 up to 0.4.}

  • k2: (optional userdefined level of privacy) defining k0-k4 will dominate the “level” parameter. k2 defines how many time to repeat k1. Five options are allowed: [0,1,2,3,4].

  • k3: (optional userdefined level of privacy) defining k0-k4 will dominate the “level” parameter. k3 defines defraction of features (among all the feaures except the unstructured ones) to swap/obfuscate on ALL the cases. For each case, the case to be swapped is chosen from a certain radius (distance, see the fifth component k4). Values between 0 to 1 are allowed.

  • k4: (optional userdefined level of privacy) defining k0-k4 will dominate the “level” parameter. The minimum fraction of closest subject pairs to be swapped. This implies that we swap the selected structured and unstructured feature values between each pair among the top k4 percentof the closest-distance subject pairs.

Details

When level=“indep” each variable in the sifted dataset is independently generated from their empirical distribution from the original data. On the other hand, level=“none” returns the original dataset. When some factors contain a level with empty value " “, it will likely to present”out of bounds" error.

The process could take a while to run with large datasets. There will be two messages indicating the progress of the process. They are Artifical missingness and imputation done“, and”Obfuscation step done“.

Output Value

Return sifted dataset.


3.2 Five components of \(\eta\)

The slider \(\eta\) is generated by combining five obfuscation steps, each one with a different impact on the data. The first component of \(\eta\) in the DataSifter is k0.

  • k0: binary option [0,1] to swap/obfuscate or not the unstructured feature defined in list4. It is not relevant for computing distance between cases.

The second and third components of \(\eta\) in the DataSifter are k1 and k2. They are linked since the obfuscation step k1 is repeated as many times as specified in k2.

  • k1: introduction of % of artificial missing values + imputation. In this example we are using values between 0% up to 40% with increments of 5%.

  • k2: how many time to repeat k1. In this example we are using 5 options: [0,1,2,3,4].

The fourth element of \(\eta\) in the DataSifter is k3.

  • k3: fraction of features (among all the features except the unstructured ones) to swap/obfuscate on ALL the cases. For each case, the case to be swapped is chosen from a certain radius (distance, see the fifth component k4). In this example we are using values between 0% up to 100% with increments of 5%.

The fifth element of \(\eta\) in the DataSifter is k4.

  • k4: the swapping step k3 is performed on each case by sampling among the k4-percentile of its neighbors (fraction of closest neighbors from which to select the case to be swapped). In this example we are using values between 1% up to 100% with increments of 1%. We are not computing the contribution of k4 into the Data Sifter slider \(\eta\) yet.
## Pick a value from each of the K_options below and type it into the k_raw
k0_options <- c(0,1)
k1_options <- c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40)
k2_options <- c(0,1,2,3,4)
k3_options <- c(0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1)
k4_options <- seq(0.01,1,0.01)

The level options were based on the following combination of \(k_0-k_4\)

Level.of.Obfuscation k0 k1 k2 k3 k4
none 0 0 0 0 0
small 0 0.05 1 0.1 0.01
medium 1 0.25 2 0.6 0.05
large 1 0.4 5 0.8 0.2
indep

3.3 List of features to be handled

Based on the five components of the slider \(\eta\), ideally we want to know 5 lists of feature types as inputs:

  1. LIST1: List of features to remove for privacy or other reasons.

  2. LIST2: List of dates.

  3. LIST3: List of categorical features (i.e., binomial/multinomial).

  4. LIST4: List of unstructured features (e.g., medicine taken with a dose spec and type of release).

  5. LIST5: List of features with known dependencies (e.g., bmi = f(weight, height)), or temporal correlation (e.g., weight/height of a kid over time). Right now it is empty

List1 is provided by the data governor. After the features in List1 are removed, constant features are also removed if present in the dataset.

List2 of date features is provided by the data governor. The obfuscation of date features will be done by complying with the time resolution \(\Delta{t}\) requested (e.g., years, months, weeks, days, hours, minutes,….). Right now we obfuscate with \(\Delta{t}=year\).

List3 of categorical features is automated, enforcing the following criteria for defining a categorical feature \(x\) (here \(N\) represents the total number of cases):

if \((unique(x) > [3*log(N)])\), then \(x\) is likely NOT CATEGORICAL.

List4 and List5 are specified by the data governor.


3.4 Similarity/Distance metric between cases

To properly define the compenents k3 and k4, distances between cases must be calculated. This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function ecodist::distance() is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, we only select numeric features. Later we may need to expand this to select factors/categorical features and strings/unstructured features. The default distance is the Bray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist”. With \(N\) representing the total number of cases, then the size of the distance vector is \(\frac{N(N-1)}{2}\).


3.5 Example of selection for the 5 options k0-k4 of \(\eta\)

# These are the values selected for this example
k0 <- 0 # swapping of unstructured data features
k1 <- 0.5 # % of data removal from the entire dataset (except unstructured features) 
k2 <- 2   # how many times to perform imputation on the "disrupted" dataset after k1
k3 <- 1 # swapping of structured features (right now done on ALL features except the unstructured ones)
k4 <- .5 # % of closest neighbors to choose from
## [1] "Current k0-k4 options selected for the Data Sifter Slider"
## [1] 0.0 0.5 2.0 1.0 0.5
## [1] "Current value of the Data Sifter Slider"
## [1] 0.963282
## [1] "Distance Sifted Data from Raw Data"
## [1] 33316.16

3.6 Comparing Raw and Sifted Data records and distributions

We compare below 2 records across 10 features in Figure 2. This step selects only double features for the record and the beanplot comparisons. It samples from the selected features, 10 features that are numeric and generates a table with Raw vs Sifted records, as well as beanplots for comparing the distributions. The goal of the Data Sifter is to obfuscate single records without disrupting the overall distribution of features. The magnitude of obfuscation is set by the five options selected to generate the sifter slider normalized value \(\eta\) between 0 and 1. Sometimes, since we swap feature values between cases that are within a certain radius (set by k4), there is a possibility to swap back and forth between the same cases. We will eventually place a control on this occurrence, if necessary.

Figure 2: Comparing 10 Feature Distributions

Figure 2: Comparing 10 Feature Distributions

Figure 2: Comparing 10 Feature Distributions

Figure 2: Comparing 10 Feature Distributions

Figure 2: Comparing 10 Feature Distributions

Figure 2: Comparing 10 Feature Distributions


3.7 Handling Longitudinal and Unstructured Data

Our early prototype of the DataSifter method shows how the resulting sifted data contains non-parametric noise patterns but preserved much of the original data energy. There is a balance between measuring and controlling of the noise and the utility of the sifted synthetic data. For a fixed level of privacy protection, to better guarantee data utility, we aim to take a step further in model-based imputation and obfuscation techniques.

We are working to extent the DataSifter statistical obfuscation technique to handle longitudinal and unstructured (text) data elements.

Longitudinal Data

To handle the correlated data, or repeated measurements longitudinally collected for some features, our enhanced DataSifter v2.0 will incorporate longitudinal imputation techniques. We propose the following regression-based strategies for longitudinal imputation.

  • Fit imputation models (could be either parametric or semiparametric) using other time-unvarying features, time, and other selective time-dependent variables as predictors. We will consider Generalized Linear Mixed Model (GLMM) or Generalized Additive Models (GAM) here.

  • When necessary, especially when the cluster size is small, we will apply non-parametric regression to account for correlation, which allows to fit the correlations with more flexibility. This method in general might be computationally intensive when selecting nonlinear kernels or spline terms, but should work ok with small cluster size, i.e. the number of repeated measurements is small.

  • Variable selection techniques such as LASSO, Grouped LASSO, or Elastic Net regularization will be embedded to develop more stable and robust imputation models for parametric models. More flexible methods such as random forest or other machine learning technologies that provides feature importance rankings can be used to select important features for both semi-parametric and non-parametric models, when data size allows.

Unstructured (text) Data

Unstructured and semi-structured datasets present substantial problems with the data representation, modeling, analysis, and interpretation. These difficulties arise from the lack of a pre-defined data model, data format complexities, incongruences of the elements, and absence of direct homologies between features across cases. Examples of semi/un-structured data include text, biospecimen, audio, video, analogue information, and proxy measures (e.g., brain images). This complexity of unstructured data inhibits out ability to automate, extract and interpret the information effectively. Employing semantic analysis, text mining, and natural language processing provides a mechanism to obtain derived quantitative measures and build explicit unstructured data element homologies between cases (subjects), which would facilitate the knowledge extraction, data discovery, and predictive analytics.

We define unstructured data (feature) as features that contain voice messages or rich text. Doctors’ notes and diagnostics notes (either oral records, or written notices) are two common unstructured features in electronic health records. Due to the fact that we cannot directly represent the features quantitatively, unstructured data can be hard to obfuscate with statistical procedures. We will apply the high-tech available in the current precision medicine era to enrich the proposed DataSifter:

  • Conversion of voice to text: Conversation transcripts whenever available, will be collected by one of existing chatbot platforms (e.g. Amazon Lex). Amazon Lex, the same technique that powers Amazon Alexa, provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable us to build applications with highly engaging user experiences and lifelike conversational interactions.

  • Text mining: We consider approaching the rich text features by expanding them into Document Term Matrix (DTM) and extract the most frequent or relevant terms of the feature. Text or Text converted from speech will be analyzed to identify keywords, sentiments, entities and relations, topics, and other important text features that are useful for the downstream tasks. In the end, the algorithm is able to create a matrix and extract keywords in the rich text feature as the sifted features.

  • Linkage of text features to modified Sifter Algorithm: Modify the sifter algorithm to include the restricted DTM with high frequency terms extracted from the original unstructured features. Since the restricted DTM cannot be reverted back to the original messages, further obfuscating the extracted features might be saved for simplicity, although further obfuscating could be applied as well for additional data protection.

3.8 Mathematical Formulation

We show now a step-by-step mathematical formulation of the DataSifter algorithm.

We use the following framework to form the DataSifter algorithm. Three sources of obfuscation have been applied to the data during the DataSifter technique: (1) initial data imputation (in the preprocessing step), (2) artificially create and impute missingness (in the imputation step), and (3) swapping data values in the neighborhood (in the obfuscation step). Here we define all the mappings that has been employed for obfuscation.

Notation

Define \(\mathcal{X}\) as the original complete sensitive dataset for sifting consisting m features ans n cases. Let’s use\(1\leq j \leq m\) to denote features and \(1\leq i\leq n\) to denote cases:

\[\mathcal{X}=(\mathbf{X}_1,...,\mathbf{X}_j,...,\mathbf{X}_m)\in R^{n \times m},\mathbf{X}_j=(X_{1,j},...,X_{n,j})^T, 1\leq j \leq m.\] in the above expression, \(X_{i,j}\) denotes the \(i-th\) suject’s \(j-th\) feature value.

We define the utility information embedded in a dataset as the knowledge about the joint distribution of the holistic data including all variables. By DataSifting preservation of utility, we mean the relative conservation of the signal energy that suggests small deviation of the sifted-data joint distribution from the original (raw) data joint distribution. Clearly, this does not hold true for large obfuscation levels (e.g., as \(\eta \rightarrow 1\)).

Define \(F_j\) as the distribution of the \(jth\) variable(feature) in the dataset.

\[X_{i,j}\sim F_j, i=1,...,n\] Missing data is pervasive in almost all real-world datasets. We define the hypothetical complete \(j-th\) feature as:

\[\mathbf{X}_j=(\mathbf{X}_{j,obs},\mathbf{X}_{j,mis}).\] where \(\mathbf{X}_{j,mis}\) denotes a vector containing the actual values of the missing data portion. What we oserve is denoted as \(\tilde{\mathbf{X}}_j=(\tilde{\mathbf{X}}_{j,obs},\mathbf{N}_j)\), here \(N_j\) represents the missing cells. The length of \(\tilde{X}_{j, obs}\) is \(n_j\) and the length of \(\mathbf{N}_j\) is \(m-n_j\).

Intial Data Imputation

This obfuscation happens in the data preprocessing step, we aim to impute the missing cells in the origin data using missForest [23]. We define the imputation method as a mapping from the observed incomplete dataset to a complete dataset with imputed values following estimated conditional distributions:

\[MF(\cdot) : (\tilde{\mathbf{X}}_1,...,\tilde{\mathbf{X}}_n)\rightarrow G,\] \[G=\{\hat{f}_j(\cdot), i=1,...,m, j=1,...,n\}.\] Here \(\hat{f}_j: (X_{i,1},...,X_{i,j-1},X_{i,j+1},...,X_{i,n})\rightarrow X_{i,j},i=1,...,m,j=1,...n\) are the conditional distributions of \(X_{i,j}\) given all other features in the dataset. MissForest uses an iterative approach with random forest models to approximate the true \(f_j\).

Then we impute the missing data with G to obtain \(\mathcal{X}^*=(\mathbf{X}^*_1,...,\mathbf{X}_n^*)\), where \(\mathbf{X}_j^*=(\tilde{\mathbf{X}}_{j,obs},\tilde{\mathbf{X}}_{j,mis})\), where \(\hat{\mathbf{X}}_{j,mis}\) follows the same distribubtion with \(X_{i,j}\).

Artifically create and impute missing

During the imputation step we introduce artificially missing observations and subsequently employ data imputation to re-generate complete instances (chains) of the dataset. Similar to initial imputation we apply \(MF(\cdot)\) to approach the true conditional distributions of the features.

We first randomly introduce missingness to the dataset after preprocessing step:

\[\mathcal{X}^{(I)}=(\mathbf{X}_1^{(I)},...,\mathbf{X}_n^{(I)}), \, \mathbf{X}_j^{(I)}=(\tilde{\mathbf{X}}_{j,obs}^{(I)},\mathbf{N}_j^{(I)})\] The new data would possess a MCAR missing mechanism. The corresponding complete data chains following the imputation is denoted by: \[\mathcal{X}^{*(I)}=(\mathbf{X}_1^{*(I)},...,\mathbf{X}_n^{*(I)}), \,\mathbf{X}_j^{*(I)}=(\tilde{\mathbf{X}}_{j,obs}^{(I)},\mathbf{\hat{X}}_{j,mis}^{(I)})\] where \(\hat{X}_{j,mis}^{(I)}\) are obtained by \(MF(\cdot)\). Assuming we have obtained the true conditional models for all the imputations, the above two sources of obfuscation would not alter the joint distribution of all features.

Neighbour data-element Swapping

To further guarantee the obfuscate has been applied to each record, we swap the data values in the neighborhood without substantially altering the joint feature distribution. We need to determine the neighbors for each record in the dataset. First, we calculate the distance matrix for all cases:

\[D= \left(\begin{array} 0 & D_{1,2} & ... &D_{1,m}\\ D_{2,1} & 0 & ... &D_{2,m}\\ ...&...& ... &...\\ D_{m,1} & D_{m,2} &...& 0 \end{array} \right) \] Here \(D_{i,j}=D_{j,i} \, \forall i,j\). \(D_{i,j}\) is the distance between the \(ith\) case and the \(jth\) case. Then, define \(a_{ij}=I(D_{ij}< min(D)+sd(D))\), here \(D^{'}=\{D_{i,j}|i\leq j\}\). We use the hard threshold \(min(D^{'})+sd(D^{'})\) to restrict the neighbourhood. Hence, the neigbourhood matrix is defined as:

\[A=(A_1,...,A_m)^T=(a_{ij})_{m\times m}\] Next, we define an index set that contains all the possible neighbours for \(jth\) case, \(\Omega_j=\{(i,j)|a_{ij}=1\,j=1,...,m}\). The swapping procedure can be represented by a set that contains all the mapping functions to be performed on \(\mathcal{X}^{*(I)}\).

\[\forall M\in \mathcal{M},M\circ\mathcal{X}^{*(I)}=\left(\begin{array} x_{1,1}\leftarrow x_{k_1^1,1} & ... &x_{1,n}\leftarrow x_{k_1^n,1}\\ \vdots&\ddots& \vdots\\ x_{m,1}\leftarrow x_{k_m^1,1} & ...& x_{m,n}\leftarrow x_{k_m^n,n} \end{array} \right)\circ \mathcal{X}^{*(I)}\]

where the notation \(x_{1,1}\leftarrow x_{k_1^1,1}\) suggests using the element \(x_{k^1_1,1}\) to replace \(x_{1,1}\), noting that here \(k_1^1\) depends on both the column and the row indices.

Finally, we define one random function that picking a specific neighbourhood from the neighbour set generated above as \(\Omega_j\):

\[g(\cdot):\Omega\rightarrow \mathcal{M},\] \[g\left(\begin{array} t\Omega_1\\ \vdots&\\ \Omega_m \end{array} \right)=M= \left(\begin{array} x_{1,1}\leftarrow x_{i_1,1} & ... &x_{1,n}\leftarrow x_{i_1,1}\\ \vdots&\ddots& \vdots\\ x_{m,1}\leftarrow x_{i_m,1} & ...& x_{m,n}\leftarrow x_{i_m,n} \end{array} \right) \] where each pair of \((i_j,j)\in \Omega_j, j=1,...,m\).

We define another function that picks \(k_4\%\) of the replacement to execute, it’s also a function that mapping to the map function set, shown as below

\[h(\cdot):\mathcal{M}\rightarrow \mathcal{M},\]

\[h\left(\left(\begin{array} x_{1,1}\leftarrow x_{i_1,1} & ... &x_{1,n}\leftarrow x_{i_1,1}\\ \vdots&\ddots& \vdots\\ x_{m,1}\leftarrow x_{i_m,1} & ...& x_{m,n}\leftarrow x_{i_m,n} \end{array} \right)\right)=\left(\begin{array} x_{1,1}\leftarrow x_{i_1^1,1} & ... &x_{1,n}\leftarrow x_{i_1^n,1}\\ \vdots&\ddots& \vdots\\ x_{m,1}\leftarrow x_{i_m^1,1} & ...& x_{m,n}\leftarrow x_{i_m^n,n} \end{array} \right)\]

where \[i_j^k=\begin{cases}i_j, & \text{means excute the replacement}\\ j, & \text{means keep the value as original} \end{cases}\] subject to the following identity:

\[\sum_{k}I(i_j^k\neq j)=k_4\%\times n,\forall j.\] A specific R-implementation of the DataSifter method is available via the DataSifter package and the dataSifter(). Detailed description and code are available in our GitHub repository (https://github.com/SOCR/DataSifter).


4 Data Sifter Demos

Next, we will demonstrate several complementary DataSifter applications to simulated and real datasets.


4.1 Synthetic Datasets

Simulation Experiment Design

We present three different simulation studies to demonstrate the performance of the DataSifter algorithm and assess its capability to (1) obfuscate and guard against stratification attempts for re-identification and (2) manage the overall data structure and preserve useful information in the resulting “sifted” data. In all experiments, we use a sample size of \(n=1,000\) subjects.

In the first simulation, a binary outcome \((y)\) and five covariates \((x_i,i=1,...,5)\) were simulated; \(X_1\) to \(X_4\) were independently generated by normal distributions with the following distribution specifications:

\[X_1,X_2\sim N(0,1),\; X_3\sim N(-1,1), \;\text{and } X_4 \sim N(0,2).\]

The binary variale \(X_5\) was directly dependent on \(X_1\) and \(X_2\):

\[logit(X_{5i})=0.5-4X_{1i}-X_{2i}\] The binary outcome variale was generated as follows:

\[logit[P(y_i=1|X)]=10+10\times X_{1i}+10\times X_{2i}-5\times X_{3i}-20\times X_{4i}-15\times X_{5i}+\epsilon_i\]

where the residuals were independent and identically distributed \((iid)\) and \(\epsilon_i\sim N(0,1)\) and \(i=1,..,n\). Missingness for \(X_1\) and \(X_2\) was then introduced based on \(X_5\) to meet the missing-at-random (MAR) criteria, which mimicked the real data situation. Denote \(X_{i,1mis}=I(X_{i1}=NA)\) and \(X_{i,2mis}=I(X_{i2}=NA)\), where \(i\) is the subject indicator. Missingness was introduced using the following probabilities:

\[P(X_{i,1mis}=1)=P(X_{i,2mis}=1)= \begin{cases} 0.0193, & if \;X_5=y=0\\ 0.060, & if \;X_5+y=1\\ 0.003, & if \;X_5+y=2 \end{cases}\]

As mentioned in the Imputation section, we can impute the original missing values in the dataset prior to applying the subsequent DataSifter algorithmic steps. However, to handle the original missingness, we have to consider MAR missingness.

The second simulation demonstrates an example of count outcomes. A Poisson model was used to generate the data.

\[P(Y_i=n)=\frac{\lambda_i^n}{n!}\times e^{-\lambda_i}\]

where

\[log(\lambda_i)=0.2+0.5\times x_1+1\times x_2-0.5\times x_3-1\times x_4-1.5\times x_5+\epsilon_i\] with \(iid\) residuals(i.e. \(\epsilon_i\sim N(0,1)\)). The covariate \(x_i,i=1,2,3,4\) were generated using uniform distributions. We constructed \(x_5\) based on \(x_1\) and \(x_2\) and used a similar strategy as in the first binary simulation to introduce missingness.

The third simulation involves continuous outcomes, where the response \(y\) is generated by a similar linear model as in the first experiment; however, it uses an identity link yielding a continuous outcome:

\[y=10+10\times x_1+10\times x_2-5\times x_3-20\times x_4-15\times x_5+\epsilon_i\] Again, the residuals were \(iid\) and, \(\epsilon_i\sim N(0,1)\). All covariates were generated from uniform distributions and the missing patterns were stochastically determined as in the first binary experiment.

For all simulation studies, we focused on verifying whether the “sifted” output datasets preserve a certain level of the energy that was present in the original true signals, relative to null signals. In addition, we examined the trade-offs between the level of obfuscation and the residual value (utility) of the resulting “sifted” data as a measure of the algorithm’s performance. To make all three simulations more realistic, we augmented the original outcome and the (real) five covariates, with 20 additional null features that acted as decoy or “noisy” control features. All 20 null features were uniformly distributed with various ranges and were independent of the outcome.

Datasifter Validation

For each simulation, we derived 30 “sifted” datasets under a range of privacy levels, from “none” to “indep” levels of obfuscation. To assess the privacy protection ability, we measured the Percent of Identical Feature Values (PIFV) between the “sifted” outcome and the original data for all the cases under each obfuscation level, i.e., we compared each subject’s original and “sifted” records and measured the ratio between the number of identical values over the total number of features. For determine utility preservation, we used regularized linear models, with an elastic net regularization term, to identify the salient variables. Internal 10-fold statistical cross-validation was used to validate the results of the elastic net feature selection. \(\mathbf{X}\) denotes the covariate matrix (subjects \(\times\) features =1,000 \(\times\) 25), \(\mathbf{y}\) is the outcome, and \(\mathbf{\beta}\) as the elastic net parameter estimates obtained by optimizing the following objective function:

\[\boldsymbol{\beta}_{enet}=argmin_{\boldsymbol{\beta}}(\mathbf{y}-\mathbf{X})^T(\mathbf{y}-\mathbf{X})+\lambda\{ \alpha\lVert \boldsymbol{\beta} \rVert^2+(1-\alpha\lVert\boldsymbol{\beta}\rVert^2)\}\]

where \(\alpha\) is the parameter to determining the blend of the LASSO and Ridge contributions to the penalty, and \(\lambda\) is the regularization penalty parameter [26]. In our experiments, we used \(\alpha=0.8\) giving a slight dominance to the LASSO penalty.

A regularization parameter tuning procedure was also performed, using misclassification error rate for binary simulation, deviance for count simulation, and mean squared error for continuous simulation. The largest \(\lambda\) value, which is within one standard error of the minimum cross-validated error, was selected as the optimal parameter [26]. When the estimated coefficient was different from zero, we considered this evidence that the corresponding feature represented a “true” predictor. On the other hand, zero coefficient estimates corresponded to “false” predictors. Recall that in all simulations, there were five true predictors and 20 null variables. The true positives (number of true features identified) and the false positives (number of null features identifies as true predictors) were recorded for all experiments and each privacy level.


4.2 Preserving Utility Information

We used simulated datasets to test the algorithm’s ability to maintain utility information. A binary outcome(y) and five independently and identical distributed covariates (\(x_i,i=1,...,5\)) are simulated. All covariate distributed uniformly ranging from 0 to 1. The binary outcome is generated by the following formula:

\[logit[P(y=1|X)]=10+10*x_1+10*x_2-5*x_3-20*x_4+5*x_5.\]

We added an unstructured variable to the simulated data to meet the required data format. The DataSifter generated outputs for 11 \(\eta\) values, \(0\leq \eta \leq 1\) are reported below. The parameter estimates, prediction accuracy, and AUC (Area Under the ROC) are computed and compared with the model built from original data (prior to obfuscation). Note that for \(\eta=0\), we have the logistic regression fitted on the original data, whereas \(\eta=1\) correspond to null synthetic data. Moreover, the prediction accuracy is calculated using the original data as test data. We found that the DataSifter preserved a fair amount of the information energy for moderate \(\eta\) values, whereas for increasing \(\eta\) the algorithm compromises the energy of the data.

The parameter estimates and model prediction accuracy for logistic regressions are listed in Table 1. From small to large \(\eta\), the parameter estimates gradually diverge from the original data. The prediction accuracy is very high for all models.

X X.Intercept. x_1 x_2 x_3 x_4 x_51 AUCs
Original Model 8.83 7.39 7.86 -4.45 -15.77 -12.75 1
eta = 0.1 2.94 2.35 2.09 -1.15 -4.74 -4.13 0.99
eta = 0.2 0.722 0.53 0.5 -0.033 -1.131 -0.405 0.919
eta = 0.3 1.23 0.15 0.23 -0.27 -1.01 -1.94 0.94
eta = 0.4 1.72 2.75 0.77 -0.67 -3.95 -0.84 0.93
eta = 0.5 -0.84 1.46 0.42 -1.84 -3.42 0.62 0.86
eta = 0.6 1.12 0.34 0.19 -0.2 -0.96 -0.88 0.88
eta = 0.7 0.644 0.907 0.347 -0.052 -0.846 -0.044 0.899
eta = 0.8 3.21 2.17 1.47 -2.03 -7.6 -0.99 0.85
eta = 0.9 -2.55 5.44 0.28 -1.29 -0.82 3.24 0.74
eta = 1 0.169 0.078 -0.09 -0.085 0.044 -0.164 0.454

Table 1: Model Comparisons.


4.3 The ABIDE dataset

The DataSifter privacy protection power relies heavily on user-defined privacy level and data structure. We measure the privacy level for each subject as percent of identical feature values (PIFV) in both sifted and original data. When eta is close to one, PIFVs are close to 0% for numerical features. For categorical features, the algorithm is able to provide PIFVs similar to the lowest PIFV between any pair of different subjects in the original dataset.

We demonstrate the functionality of the DataSifter on the Autism Brain Imaging Data Exchange (ABIDE) dataset. The ABIDE dataset represents a multi-institutional effort for aggregating and sharing the imaging, clinical and phenotypic data of 1,112 volunteers. The data includes resting-state functional magnetic resonance imaging (rs-fMRI) structural MRI, and phenotypic information of 539 patients (autism spectrum disorder) and 573 age-matched asymptomatic controls. In our study, we selected a subsample of 1,098 patients including 528 autism spectrum disorder (ASD) and 570 controls. The dataset has 500 structural MRI biomarkers and phenotypical information such as age, sex and IQ. It is a very challenging case-study due to the heterogeneity of the data, format of the data elements, and the complexity of mental health phenotypes. We use the ABIDE data to showcase the performance of the DataSifter technique on a convoluted multiplex study. The ABIDE dataset comprises 1,098 patients and 506 features. We included one unstructured feature, “image data file name” (“Data”), in the dataset to demonstrate the DataSifter ability to obfuscate unstructured text elements. Resembling the simulation experiments, we built a dataSifter() function that has five different levels of obfuscation to demonstrate the obfuscation utility trade-off.

Figure 3 illustrates the relationship between median PIFVs for the ABIDE dataset and user-defined privacy levels.

The clear negative relationship between the two characteristics indicates that the data governor’s specification of \(\eta\) does provide increasing patient privacy protection levels. When a user requires higher levels of obfuscation, PIFVs are approaching zero percent. In this scenario, patient privacy is highly protected. Data hackers are almost impossible to filter the targeted patients via known feature values. This is because he cannot distinguish between imputed or exchanged values with the real feature values. Moreover, the untouched proportion of data is relatively small. Although the data hacker might request multiple queries, the overlap untouched proportion would be small when \(\eta\) is high. The small proportion of “true” values protects patient identity and allows the data user to request for a small number of overlapping queries.

Figure 3: Median Percent Feature Match under Different Etas-ABIDE data

Figure 3: Median Percent Feature Match under Different Etas-ABIDE data

Figure 4 shows the PIFV for simulated data. Using 6 variables, the median PIFV has a gradually decreasing pattern as \(\eta \rightarrow 1\). Note that since the feature set includes a binary variable, the data simulated from the empirical distribution (\(\eta=1\)) still has a variable \(y\) matching the original dataset for the majority of cases.

Figure 4: Median Percent Feature Match under Different Etas-Simulated data

Figure 4: Median Percent Feature Match under Different Etas-Simulated data


5 Acknowledgments

We thank all the co-authors of the Data Sifter manuscript, namely Yi Zhao, Lu Wang, Lu Wei and Qiuncheng Wu, for contributing to the Data Sifter project in many ways. We also want to thank Yi Zhao with useful insights for the DataSifter mathematical formulation. More details can be found in the manuscript Marino, S, Zhou, N, Zhao, Yi, Wang, L, Wu Q, and Dinov, ID. (2018) DataSifter: Statistical Obfuscation of Electronic Health Records and Other Sensitive Datasets, Journal of Statistical Computation and Simulation, pp: 1-23, DOI: 10.1080/00949655.2018.1545228..

A US Patent has also been filed, namley US Patent Application 16/051881 (August 1, 2018).


SOCR Resource Visitor number SOCR Email