SOCR: Data Value Metric (DVM)

The SOCR DVM project provides a novel measure, called Data Value Metric (DVM), to quantify the utility and energy, or information content, of large and complex datasets. DVM can be used to determine if appending, expanding, or otherwise augmenting the data size or complexity may be beneficial in specific application domains.
Table 1: Test results by signal profile (Supervised methods & Strong signals)
Dataset Method Sample-size plot Number of features plot 3D DVM Plot
MNIST KNN MNIST_KNN_Sample MNIST_KNN_Feature MNIST_KNN_3D_DVM_Surface
MNIST Boosting MNIST_Boosting_Sample MNIST_Boosting_Feature MNIST_Boosting_3D_DVM_Surface
MNIST Linear Regression MNIST_LinearRegression_Sample MNIST_LinearRegression_Feature MNIST_LinearRegression_3D_DVM_Surface
MNIST Lasso Regression MNIST_LassoRegression_Sample MNIST_LassoRegression_Feature MNIST_LassoRegression_3D_DVM_Surface
Simulated KNN Simulated_KNN_Sample Simulated_KNN_feature Simulated_KNN_3D_DVM_Surface
Simulated Linear Regression Simulated_LinearRegression_Sample Simulated_LinearRegression_Feature Simulated_LinearRegression_3D_DVM_Surface
Simulated Lasso Regression Simulated_LassoRegression_Sample Simulated_LassoRegression_Feature Simulated_LassoRegression_3D_DVM_Surface


Table 2: Test results by the signal profile (Supervised methods & Weak signals)
Dataset Method Sample-size plot Number of features plot 3D DVM Plot
ALS KNN ALS_KNN_Sample ALS_KNN_Feature ALS_KNN_3D_DVM_Surface
ALS Boosting ALS_Boosting_Sample ALS_Boosting_Feature ALS_Boosting_3D_DVM_Surface
ALS Random Forest ALS_RandomForest_Sample ALS_RandomForest_Feature ALS_RandomForest_3D_DVM_Surface
ALS Linear Regression ALS_LinearRegression_Sample ALS_LinearRegression_Feature ALS_LinearRegression_3D_DVM_Surface
ALS Lasso Regression ALS_LassoRegression_Sample ALS_LassoRegression_Feature ALS_LassoRegression_3D_DVM_Surface
Simulated KNN Simulated_KNN_Sample Simulated_KNN_Feature Simulated_KNN_3D_DVM_Surface
Simulated Boosting Simulated_Boosting_Sample Simulated_Boosting_Feature Simulated_Boosting_3D_DVM_Surface
Simulated Linear Regression Simulated_LinearRegression_Sample Simulated_LinearRegression_Feature Simulated_LinearRegression_3D_DVM_Surface
Simulated Lasso Regression Simulated_LassoRegression_Sample Simulated_LassoRegression_Feature Simulated_LassoRegression_3D_DVM_Surface


Table 3: Test results by signal profile (Unsupervised methods & Strong signals)
Dataset Method Sample-size plot Number of features plot 3D DVM Plot
MNIST KMeans MNIST_KMeans_Sample MNIST_KMeans_Feature MNIST_KMeans_3D_DVM_Surface
MNIST Affinity Propagation MNIST_AP_Sample MNIST_AP_Feature MNIST_AP_3D_DVM_Surface
MNIST Agglomerative Clustering MNIST_Agglomerative_Sample MNIST_Agglomerative_Feature MNIST_Agglomerative_3D_DVM_Surface
Simulated KMeans Simulated_KMeans_Sample Simluated_KMeans_Feature Simluated_KMeans_3D_DVM_Surface
Simulated Affinity Propagation Simulated_AP_Sample Simulated_AP_Feature Simluated_AP_3D_DVM_Surface


Table 4: Test results by signal profile (Unsupervised methods & Weak signals)
Dataset Method Sample-size plot Number of features plot 3D DVM Plot
ALS Affinity Propagation ALS_AP_Sample ALS_AP_Feature ALS_AP_3D_DVM_Surface
ALS Agglomerative Clustering ALS_Agglomerative_Sample ALS_Agglomerative_Feature ALS_Agglomerative_3D_DVM_Surface
Simulated Affinity Propagation Simulated_AP_Sample Simulated_AP_Feature Simulated_AP_3D_DVM_Surface


Reference: Noshad, M, Choi, J, Sun, Y, Hero, A, Dinov, ID. (2021) A data value metric for quantifying information content and utility, Journal of Big Data, DOI: 10.1186/s40537-021-00446-6, in print.