r - Trimming big data -


i working on similar issue stated on this other posting , tried adapting code select columns interested in , making fit data file.

my issue, however, resulting file has become larger original one, , i'm not sure code working way intended.

when open spss, dataset seems have taken in header line, , made millions of copies without end of second line (i had force stop process).

i noticed there's no counter in while loop specifying line, might case? background in programming r limited. file .csv , 4.8gb 329 variables , millions of rows. need keep around 30 of variables.

this code used:

##open separate connections hold cursor position  file.in <- file('npidata_20050523-20130707.csv', 'rt') file.out<- file('mainoutnpidata.txt', 'wt') line<-readlines(file.in,n=1) line.split <-strsplit(line, ',')  ##column picking, column 1  cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311], sep = ",", file = file.out, fill= true)  ##use loop read in rest of lines line <-readlines(file.in, n=1) while (length(line)){     line.split <-strsplit(line, ',') if (length(line.split[[1]])>1) {         cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311],sep = ",", file = file.out, fill= true)     } } close(file.in) close(file.out) 

one thing wrong jumps out missing lines <- readlines(file.in, n=1) inside while loop. stuck in infinite loop. also, reading 1 line @ time going terribly slow.

if in file (unlike 1 in example linked to) every row contains same number of columns, use laf package. should result in along lines of:

library(laf) m <- detect_dm_csv("npidata_20050523-20130707.csv", header=true) laf <- laf_open(m) begin(laf) con <- file("mainoutnpidata.txt", 'wt') while(true) {   d <- next_block(laf, columns = c(1:11, 23:25, 31:33, 308:311))   if (nrow(d) == 0) break;   write.csv(d, file=con, row.names=false, header=false) } close(con) close(laf) 

if 30 columns fit memory do:

library(laf) m <- detect_dm_csv("npidata_20050523-20130707.csv", header=true) laf <- laf_open(m) d <- laf[, c(1:11, 23:25, 31:33, 308:311)] close(laf) 

i couldn't test code above on file, can't guarantee there no errors (let me know if there are).


Comments