SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
This DSPA Appendix shows examples of dealing with extremely large datasets using compressive big data analytics and stochastic sampling from flat-files containing structured second-order tensor data.
Suppose we have an unusually large flat-file (CSV) that includes a structured tensor data with \(10^7\) rows (lines) and \(10^5\) columns (variables/features), of size over \(1TB\), which makes it difficult to impractical to process on most computers in real time. Often, we need an efficient mechanism to randomly sample \(n\) cases (rows) and \(k\) features (columns), a subset denoted by \(\mathbb{I}\), that can be extracted automatically from the large file without reading in the entire file, which may be impractical. One approach is to to skip over intermediate lines (rows) and features (columns) and only read from the file the randomly sampled indices \(\{(row_{(r)},column_{(c)})\}_{(r,c)\in\mathbb{I}}\).
Let’s outline of an approach and include an illustrative
R
code with supporting \(C/C^{++}\) backend functionality, that
allows efficient data subsampling sampling from very large CSV files
without loading the entire data into memory. The core protocol includes
the following 3 steps.
seek()
to jump
directly to the start of any chosen row.This stochastic sampling mechanism minimizes the memory demand (yet, increases the computational complexity) by avoiding reading the entire dataset or even entire lines’ worth of unneeded columns.
CSV files are line-oriented, so accessing a given row \(r\) requires knowing the byte offset of its starting position in the file. We must preprocess the file to build a line offset index. In principle, CSV files include column-delimited text. Without fixed-width fields or a precomputed column index, we can’t jump directly to a column. However, once we’ve jumped to the line, we can parse only the necessary columns by scanning through the line.
This implementation is in R
and utilizes a supplemental
C
code for efficiency.
C
routine (no full parsing, just a fast newline
count).R
.R
(reading
line-by-line once, but efficiently).R
.C
routine for speed and efficiency.This code is a template and may need adaptation for specific environments and case-studies. The CSV file is assumed to be newline-terminated lines and comma-separated. The error handling is minimal and may need to be expanded to gracefully handle run-time errors.
C
Code
(count_lines.c)This C
code defines a function to count the number
of lines in a given file by scanning through it and counting
newline characters. We will compile it into a shared library and call it
from R
using .Call
.
count_lines.c
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rconn.h>
#include <stdio.h>
// [[register]]
SEXP count_lines(SEXP r_file_path) {
const char *file_path = CHAR(STRING_ELT(r_file_path, 0));
FILE *fp = fopen(file_path, "rb");
if (!fp) {
Rf_error("Could not open file: %s", file_path);
}
char buf[65536];
size_t nread;
long long line_count = 0;
while ((nread = fread(buf, 1, sizeof(buf), fp)) > 0) {
for (size_t i = 0; i < nread; i++) {
if (buf[i] == '\n') line_count++;
}
}
fclose(fp);
return ScalarReal((double)line_count);
}
sourceCpp(...)
without relying on
external shell scripts or R CMD SHLIB
.This approach uses Rcpp::sourceCpp(code = " ... ")
inline, so that the entire code snippet works in the R
session and is compiled on-the-fly. The code defines two exported
functions:
count_lines_cpp(file_path)
: Counts the number of lines
in a file efficiently by scanning for newline characters. This \(C++\) function
count_lines_cpp()
is compiled via
Rcpp::sourceCpp
and used to determine the total number of
lines in the CSV file without fully parsing it.Once we know the total number of lines (num_rows), we prompt
or use a user-specified number of cases (\(n\)) to randomly sample those lines. This
is accomplished by building a line offset index to allow direct random
access to the sampled lines without reading the entire file. For each
sampled line, we read in the entire row and then determine the number of
features (columns) by counting the delimiters (commas), and report the
number of columns for each sampled row. This code does not rely on
reading just the first line for column count. Instead, it obtains the
column count per each sampled row independently. The \(C++\) code that contains
count_lines_cpp()
has to be sourced into R
via
Rcpp::sourceCpp
and we assume the CSV is comma-separated
and well-formed.
The supporting parse_line.cpp file is available here.
# # Ensure Rcpp is available
# if (!requireNamespace("Rcpp", quietly = TRUE)) {
# install.packages("Rcpp")
# }
library(Rcpp)
# 1. Define and compile parse_line_cpp function, save it in external file: parse_line.cpp
# We require the user to specify both 'sep' and 'quote' at runtime.
# 'sep' is a single character delimiter, e.g. ","
# 'quote' is either empty "" for no quoting or a single character, e.g. "\""
# Rcpp::sourceCpp(code='
# #include <Rcpp.h>
# using namespace Rcpp;
#
# // [[Rcpp::export]]
# CharacterVector parse_line_cpp(String line, String sep, String quote) {
# std::string str = Rcpp::as<std::string>(line);
#
# if (sep.size() != 1) {
# Rf_error("Delimiter must be a single character.");
# }
# char delimiter = sep.get_cstring()[0];
#
# bool use_quote = !quote.empty();
# char quote_char = '\0';
# if (use_quote) {
# std::string q = Rcpp::as<std::string>(quote);
# if (q.size() != 1) {
# Rf_error("Quote must be a single character if provided.");
# }
# quote_char = q[0];
# }
#
# std::vector<std::string> fields;
# fields.reserve(10);
#
# size_t line_len = str.size();
# size_t pos = 0;
# bool in_quotes = false;
# std::string current_field;
# current_field.reserve(64);
#
# while (pos <= line_len) {
# bool end_of_line = (pos == line_len);
# char c = end_of_line ? '\0' : str[pos];
#
# if (in_quotes) {
# // Inside quoted field
# if (end_of_line) {
# // End of line but still in quotes:
# in_quotes = false;
# fields.push_back(current_field);
# current_field.clear();
# break;
# } else if (c == quote_char) {
# // Could be end of quotes or escaped quotes
# if (pos + 1 < line_len && str[pos+1] == quote_char) {
# // Escaped quote: add one quote_char
# current_field.push_back(quote_char);
# pos += 2;
# continue;
# } else {
# // End of quoted section
# in_quotes = false;
# pos++;
# continue;
# }
# } else {
# // Normal char inside quotes
# current_field.push_back(c);
# pos++;
# }
# } else {
# // Not inside quotes
# if (end_of_line || c == delimiter) {
# // End of field
# fields.push_back(current_field);
# current_field.clear();
# pos++;
# } else if (use_quote && c == quote_char) {
# // Start of quoted field
# in_quotes = true;
# pos++;
# } else {
# // Normal char outside quotes
# current_field.push_back(c);
# pos++;
# }
# }
# }
#
# CharacterVector out(fields.size());
# for (size_t i = 0; i < fields.size(); i++) {
# out[i] = fields[i];
# }
# return out;
# }
# ')
# Source the external CPP parse_line_cpp() function
Rcpp::sourceCpp(file="parse_line.cpp")
# 2. Functions to build line offsets in binary mode
# Function to scan the entire file in binary mode, finds all '\n' positions
build_line_index_binary <- function(file_path) {
con <- file(file_path, "rb")
on.exit(close(con))
chunk_size <- 65536
pos <- 0
offsets <- numeric(1e6) # initial guess, expand if needed
count <- 0
repeat {
buf <- readBin(con, what="raw", n=chunk_size)
nread <- length(buf)
if (nread == 0) break
nl_positions <- which(buf == as.raw(0x0A)) # '\n'
for (nlpos in nl_positions) {
count <- count + 1
if (count > length(offsets)) {
offsets <- c(offsets, numeric(length(offsets)))
}
offsets[count] <- pos + (nlpos - 1)
}
pos <- pos + nread
if (nread < chunk_size) break
}
offsets[1:count]
}
# Convert newline offsets to line start offsets
build_line_offsets <- function(file_path) {
nl_offsets <- build_line_index_binary(file_path)
num_lines <- length(nl_offsets)
if (num_lines == 0) return(numeric(0))
line_offsets <- numeric(num_lines)
line_offsets[1] = 0
if (num_lines > 1) {
line_offsets[2:num_lines] = nl_offsets[1:(num_lines-1)] + 1
}
line_offsets
}
# 3. Read a specific line by offset, using line_offsets[i], the byte start of line i
read_line_at <- function(fileCon, line_offsets, i, fsize) {
start_pos <- line_offsets[i]
end_pos <- if (i < length(line_offsets)) line_offsets[i+1] - 1 else (fsize - 1)
if (end_pos < start_pos) stop("Invalid line offsets, end_pos < start_pos.")
length_to_read <- end_pos - start_pos + 1
seek(fileCon, where = start_pos, origin = "start")
raw_data <- readBin(fileCon, what = "raw", n = length_to_read)
line_str <- rawToChar(raw_data)
# Normalize line endings: remove trailing \r or \r\n or \n
line_str <- sub("\r?\n$", "", line_str)
line_str
}
Using this syntax, file_path <- "/path/data.csv"
, we
can ingest a large data from local (NSF) or remote (URL) partition. In
this case we will use a test dataset that represents a 2D tensor of size
\((502236, 23960)\) saved as a flat CSV
file, \(40GB\).
Next, we randomly sample \(50\) out of the \(502,236\) rows and output a dataframe of size \((50, 23960)\), part of which will be displayed as an interactive data-table. Clearly this is just an example, which demonstrated how to quickly and efficiently generate random samples from large flat files. These smaller samples can be used for CBDA, for other statistical learning strategies, and for GAIM training.
fsize <- file.size(file_path)
line_offsets <- build_line_offsets(file_path)
num_rows <- length(line_offsets)
cat("Number of rows:", num_rows, "\n")
## Number of rows: 502236
# Let's pick some random lines:
set.seed(1234)
n <- 50 # number of rows to randomly sample
k <- 100 # of random column to sample
r <- sample(num_rows, n)
cat("Random lines:", r, "\n")
## Random lines: 237392 106390 304108 408457 295846 494468 126055 382554 345167 342900 347518 461374 249732 322088 199608 148452 426463 311930 279106 380591 501115 237360 229507 425001 181238 216904 491818 164098 311375 383281 330424 131974 46301 492081 37000 124049 444539 492778 463343 371491 30928 166742 58890 413590 59011 139698 295672 67287 376316 233641
fileCon <- file(file_path, "rb")
# Read the CSV Headers, first line
i <- 1
header_line <- read_line_at(fileCon, line_offsets, i, fsize)
# parse columns
headerLN <- parse_line_cpp(header_line, ",", "\"")
str(headerLN)
## chr [1:23960] "eid" "3-0.0" "3-1.0" "3-2.0" "3-3.0" "4-0.0" "4-1.0" ...
# Test the 10-th randomly chosen line
i <- r[10]
row_line <- read_line_at(fileCon, line_offsets, i, fsize)
# parse columns
ln <- parse_line_cpp(row_line, ",", "\"")
str(ln)
## chr [1:23960] "4431681" "439" "" "724" "" "739" "" "585" "" "451" "" "636" ...
# Notes:
# The line offsets are computed in binary mode (exactly matching file structure)
# Lines are read in binary mode and converted to text after seeking to exact positions
# The function parse_line_cpp() carefully handles quoting and delimiters.
# 4. Read in the entire rows for the sampled lines and determine number of columns per each row
columns_per_row <- integer(n)
row_line <- vector() # instantiate the vector of random lines
columns_per_row <- vector() # instantiate the vector of columns for each randomly sampled row
rowElements <- vector("list", n) # A list of lists:
# 'rowElements' is indexed by the sampled lines (or by j)
# Each element 'rowElements[[j]]' is a list of column values for that line
# rowElements is a list where each element corresponds to one of the randomly chosen lines.
# Each element in rowElements is itself a list of the column values from that line.
# Fields of the first sampled line
# rowElements[[1]]
# A particular field in that line, say field #3
# rowElements[[1]][[3]]
for (j in c(1:n)) {
i <- r[j] # get the j-th random row index, i, and read the random line
row_line[j] <- read_line_at(fileCon, line_offsets, i, fsize)
# parse columns
ln <- parse_line_cpp(row_line[j], ",", "\"")
columns_per_row[j] <- length(ln)
rowElements[[j]] <- as.list(ln)
cat(sprintf("Random-Row %d represents Raw-Data Row ID %d, which has %d features (columns)\n", j, i, columns_per_row[j]))
}
## Random-Row 1 represents Raw-Data Row ID 237392, which has 23960 features (columns)
## Random-Row 2 represents Raw-Data Row ID 106390, which has 23960 features (columns)
## Random-Row 3 represents Raw-Data Row ID 304108, which has 23960 features (columns)
## Random-Row 4 represents Raw-Data Row ID 408457, which has 23960 features (columns)
## Random-Row 5 represents Raw-Data Row ID 295846, which has 23960 features (columns)
## Random-Row 6 represents Raw-Data Row ID 494468, which has 23960 features (columns)
## Random-Row 7 represents Raw-Data Row ID 126055, which has 23960 features (columns)
## Random-Row 8 represents Raw-Data Row ID 382554, which has 23960 features (columns)
## Random-Row 9 represents Raw-Data Row ID 345167, which has 23960 features (columns)
## Random-Row 10 represents Raw-Data Row ID 342900, which has 23960 features (columns)
## Random-Row 11 represents Raw-Data Row ID 347518, which has 23960 features (columns)
## Random-Row 12 represents Raw-Data Row ID 461374, which has 23960 features (columns)
## Random-Row 13 represents Raw-Data Row ID 249732, which has 23960 features (columns)
## Random-Row 14 represents Raw-Data Row ID 322088, which has 23960 features (columns)
## Random-Row 15 represents Raw-Data Row ID 199608, which has 23960 features (columns)
## Random-Row 16 represents Raw-Data Row ID 148452, which has 23960 features (columns)
## Random-Row 17 represents Raw-Data Row ID 426463, which has 23960 features (columns)
## Random-Row 18 represents Raw-Data Row ID 311930, which has 23960 features (columns)
## Random-Row 19 represents Raw-Data Row ID 279106, which has 23960 features (columns)
## Random-Row 20 represents Raw-Data Row ID 380591, which has 23960 features (columns)
## Random-Row 21 represents Raw-Data Row ID 501115, which has 23960 features (columns)
## Random-Row 22 represents Raw-Data Row ID 237360, which has 23960 features (columns)
## Random-Row 23 represents Raw-Data Row ID 229507, which has 23960 features (columns)
## Random-Row 24 represents Raw-Data Row ID 425001, which has 23960 features (columns)
## Random-Row 25 represents Raw-Data Row ID 181238, which has 23960 features (columns)
## Random-Row 26 represents Raw-Data Row ID 216904, which has 23960 features (columns)
## Random-Row 27 represents Raw-Data Row ID 491818, which has 23960 features (columns)
## Random-Row 28 represents Raw-Data Row ID 164098, which has 23960 features (columns)
## Random-Row 29 represents Raw-Data Row ID 311375, which has 23960 features (columns)
## Random-Row 30 represents Raw-Data Row ID 383281, which has 23960 features (columns)
## Random-Row 31 represents Raw-Data Row ID 330424, which has 23960 features (columns)
## Random-Row 32 represents Raw-Data Row ID 131974, which has 23960 features (columns)
## Random-Row 33 represents Raw-Data Row ID 46301, which has 23960 features (columns)
## Random-Row 34 represents Raw-Data Row ID 492081, which has 23960 features (columns)
## Random-Row 35 represents Raw-Data Row ID 37000, which has 23960 features (columns)
## Random-Row 36 represents Raw-Data Row ID 124049, which has 23960 features (columns)
## Random-Row 37 represents Raw-Data Row ID 444539, which has 23960 features (columns)
## Random-Row 38 represents Raw-Data Row ID 492778, which has 23960 features (columns)
## Random-Row 39 represents Raw-Data Row ID 463343, which has 23960 features (columns)
## Random-Row 40 represents Raw-Data Row ID 371491, which has 23960 features (columns)
## Random-Row 41 represents Raw-Data Row ID 30928, which has 23960 features (columns)
## Random-Row 42 represents Raw-Data Row ID 166742, which has 23960 features (columns)
## Random-Row 43 represents Raw-Data Row ID 58890, which has 23960 features (columns)
## Random-Row 44 represents Raw-Data Row ID 413590, which has 23960 features (columns)
## Random-Row 45 represents Raw-Data Row ID 59011, which has 23960 features (columns)
## Random-Row 46 represents Raw-Data Row ID 139698, which has 23960 features (columns)
## Random-Row 47 represents Raw-Data Row ID 295672, which has 23960 features (columns)
## Random-Row 48 represents Raw-Data Row ID 67287, which has 23960 features (columns)
## Random-Row 49 represents Raw-Data Row ID 376316, which has 23960 features (columns)
## Random-Row 50 represents Raw-Data Row ID 233641, which has 23960 features (columns)
on.exit(close(fileCon))
# 5. Convert the vector of lists, rowElements[[]], into a data frame
# As rows may have different numbers of columns, some lines may have more or fewer columns than others,
# We determine the maximum number of columns and pad shorter rows with NA
# Find the maximum number of columns among all rows
max_col <- max(sapply(rowElements, length))
# Pad each row's list of fields to max_col length with NA if needed
padded <- lapply(rowElements, function(x) {
if (length(x) < max_col) {
c(x, rep(NA, max_col - length(x)))
} else { x }
})
# Now all have the same length, convert to data frame
df <- data.frame(do.call(rbind, lapply(padded, unlist)), stringsAsFactors = FALSE)
colnames(df) <- headerLN
str(df)
## 'data.frame': 50 obs. of 23960 variables:
## $ eid : chr "3375674" "2064726" "4043400" "5087698" ...
## $ 3-0.0 : chr "356" "441" "441" "310" ...
## $ 3-1.0 : chr "" "" "" "" ...
## $ 3-2.0 : chr "" "" "" "" ...
## $ 3-3.0 : chr "" "" "" "" ...
## $ 4-0.0 : chr "589" "516" "315" "510" ...
## $ 4-1.0 : chr "" "" "" "" ...
## $ 4-2.0 : chr "" "" "" "" ...
## $ 4-3.0 : chr "" "" "" "" ...
## $ 5-0.0 : chr "160" "180" "627" "289" ...
## $ 5-1.0 : chr "" "" "" "" ...
## $ 5-2.0 : chr "" "" "" "" ...
## $ 5-3.0 : chr "" "" "" "" ...
## $ 6-0.0 : chr "6" "10" "29" "16" ...
## $ 6-1.0 : chr "" "" "" "" ...
## $ 6-2.0 : chr "" "" "" "" ...
## $ 6-3.0 : chr "" "" "" "" ...
## $ 19-0.0 : chr "" "" "" "" ...
## $ 21-0.0 : chr "1" "1" "1" "1" ...
## $ 21-1.0 : chr "" "" "" "" ...
## $ 21-2.0 : chr "" "" "" "" ...
## $ 21-3.0 : chr "" "" "" "" ...
## $ 23-0.0 : chr "0" "0" "" "0" ...
## $ 23-1.0 : chr "" "" "" "" ...
## $ 23-2.0 : chr "" "" "" "" ...
## $ 23-3.0 : chr "" "" "" "" ...
## $ 31-0.0 : chr "1" "0" "1" "1" ...
## $ 34-0.0 : chr "1943" "1945" "1957" "1964" ...
## $ 35-0.0 : chr "1" "1" "1" "1" ...
## $ 35-1.0 : chr "" "" "" "" ...
## $ 35-2.0 : chr "" "" "" "" ...
## $ 35-3.0 : chr "" "" "" "" ...
## $ 36-0.0 : chr "004589" "001041" "001039" "003158" ...
## $ 36-1.0 : chr "" "" "" "" ...
## $ 36-2.0 : chr "" "" "" "" ...
## $ 36-3.0 : chr "" "" "" "" ...
## $ 37-0.0 : chr "" "" "" "" ...
## $ 37-1.0 : chr "" "" "" "" ...
## $ 37-2.0 : chr "" "" "" "" ...
## $ 37-3.0 : chr "" "" "" "" ...
## $ 38-0.0 : chr "006987" "006604" "007009" "007012" ...
## $ 38-1.0 : chr "" "" "" "" ...
## $ 38-2.0 : chr "" "" "" "" ...
## $ 38-3.0 : chr "" "" "" "" ...
## $ 39-0.0 : chr "002875" "002874" "002873" "002888" ...
## $ 39-1.0 : chr "" "" "" "" ...
## $ 39-2.0 : chr "" "" "" "" ...
## $ 39-3.0 : chr "" "" "" "" ...
## $ 40-0.0 : chr "001173" "001175" "001178" "001189" ...
## $ 40-1.0 : chr "" "" "" "" ...
## $ 40-2.0 : chr "" "" "" "" ...
## $ 40-3.0 : chr "" "" "" "" ...
## $ 41-0.0 : chr "001141" "001146" "001156" "001159" ...
## $ 41-1.0 : chr "" "" "" "" ...
## $ 41-2.0 : chr "" "" "" "" ...
## $ 41-3.0 : chr "" "" "" "" ...
## $ 42-0.0 : chr "03120" "04907" "03124" "04724" ...
## $ 42-1.0 : chr "" "" "" "" ...
## $ 42-2.0 : chr "" "" "" "" ...
## $ 42-3.0 : chr "" "" "" "" ...
## $ 43-0.0 : chr "001486" "001481" "001490" "002765" ...
## $ 43-1.0 : chr "" "" "" "" ...
## $ 43-2.0 : chr "" "" "" "" ...
## $ 43-3.0 : chr "" "" "" "" ...
## $ 44-0.0 : chr "004807" "006023" "006935" "006980" ...
## $ 44-1.0 : chr "" "" "" "" ...
## $ 44-2.0 : chr "" "" "" "" ...
## $ 44-3.0 : chr "" "" "" "" ...
## $ 45-0.0 : chr "004921" "002113" "006642" "002846" ...
## $ 45-1.0 : chr "" "" "" "" ...
## $ 45-2.0 : chr "" "" "" "" ...
## $ 45-3.0 : chr "" "" "" "" ...
## $ 46-0.0 : chr "30" "26" "30" "48" ...
## $ 46-1.0 : chr "" "" "" "" ...
## $ 46-2.0 : chr "" "" "" "" ...
## $ 46-3.0 : chr "" "" "" "" ...
## $ 47-0.0 : chr "35" "28" "31" "54" ...
## $ 47-1.0 : chr "" "" "" "" ...
## $ 47-2.0 : chr "" "" "" "" ...
## $ 47-3.0 : chr "" "" "" "" ...
## $ 48-0.0 : chr "94" "95" "79" "94" ...
## $ 48-1.0 : chr "" "" "" "" ...
## $ 48-2.0 : chr "" "" "" "" ...
## $ 48-3.0 : chr "" "" "" "" ...
## $ 49-0.0 : chr "110" "104" "90" "102" ...
## $ 49-1.0 : chr "" "" "" "" ...
## $ 49-2.0 : chr "" "" "" "" ...
## $ 49-3.0 : chr "" "" "" "" ...
## $ 50-0.0 : chr "198" "166" "173" "185" ...
## $ 50-1.0 : chr "" "" "" "" ...
## $ 50-2.0 : chr "" "" "" "" ...
## $ 50-3.0 : chr "" "" "" "" ...
## $ 51-0.0 : chr "147" "141" "142" "145" ...
## $ 51-1.0 : chr "" "" "" "" ...
## $ 51-2.0 : chr "" "" "" "" ...
## $ 51-3.0 : chr "" "" "" "" ...
## $ 52-0.0 : chr "1" "7" "9" "9" ...
## $ 53-0.0 : chr "2009-07-24" "2009-07-29" "2010-02-16" "2009-09-22" ...
## $ 53-1.0 : chr "" "" "" "" ...
## [list output truncated]
library(DT)
# Use the DT package to render a small portion of the randomly sampled data from the large CSV file
datatable(df[1:30, 1:30], options = list(pageLength = 25, autoWidth = TRUE))
Notes: We use Rcpp::sourceCpp()
to compile on the fly
the C++
code directly in R
at runtime.
C++
and exported to R
to scan the file once,
counts newlines, and returns the line count as a double. You can use
as.integer()
in R
if needed.C++
function
taking a CSV line and a set of column indices and returning just those
fields.This approach avoids reading the entire file and avoids loading massive chunks of unneeded data. With some preprocessing and careful coding (potentially with C++ acceleration), you can efficiently sample arbitrary sets of rows and columns from a huge CSV file.