r - Efficiently counting numbers falling within each range of numbers -


i'm looking faster solution problem below. i'll illustrate problem small example , provide code simulate large data that's point of question. actual problem size of list length = 1 million entries.

say, i've 2 lists shown below:

x <- list(c(82, 18), c(35, 50, 15)) y <- list(c(1,2,3,55,90), c(37,38,95)) 

properties of x , y:

  • each element of list x sums 100.
  • each element of y sorted , between 1 , 100.

the problem:

now, i'd this. taking x[[1]] , y[[1]], i'd find count of numbers in y[[1]] 1) <= 82 , 2) > 82 , <= 100. be, c(4, 1) because numbers <= 82 c(1,2,3,55) , number between 83 , 100 c(90). x[[2]] , y[[2]], c(0, 2, 1). is, answer should be:

[[1]] [1] 4 1  [[2]] [1] 0 2 1 

let me know if still unclear.


simulated data 1 million entries

set.seed(1) n <- 100 n <- 1e6 len <- sample(2:3, n, true)  x <- lapply(seq_len(n), function(ix) {     probs <- sample(100:1000, len[ix])     probs <- probs/sum(probs)      oo <- round(n * probs)     if (sum(oo) != 100) {         oo[1] <- oo[1] + (100 - sum(oo))     }     oo })  require(data.table) ss <- sample(1:10, n, true) dt <- data.table(val=sample(1:n, sum(ss), true), grp=rep(seq_len(n), ss)) setkey(dt, grp, val) y <- dt[, list(list(val)),by=grp]$v1 

what i've done far:

using mapply (slow):

i thought of using rank ties.method="first" , mapply (obvious choice 2 lists) first , tried out this:

tt1 <- mapply(y, x, fun=function(a,b) {      tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1) }) 

although works fine, takes lot of time on 1m entries. think overhead of computing rank , diff many times adds it. takes 241 seconds!

therefore, decided try , overcome usage of rank , diff using data.table , sorting "group" column. came longer much faster solution shown below:

using data.table (faster):

xl <- sapply(x, length) yl <- sapply(y, length) xdt <- data.table(val=unlist(x, use.names=false), grp=rep(seq_along(xl), xl), type = "x") xdt[, cumval := cumsum(val), by=grp] ydt <- data.table(val=unlist(y, use.names=false), grp=rep(seq_along(yl), yl), type = "y") tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)])) setkey(tt2, grp, val) xdt.pos <- which(tt2$type == "x") tt2[, type.x := 0l][xdt.pos, type.x := xdt.pos] tt2 <- tt2[xdt.pos][tt2[, .n, = grp][, n := cumsum(c(0, head(n, -1)))]][, sub := type.x - n] tt2[, val := xdt$val]  # time consuming step tt2 <- tt2[, c(sub[1]-1, sub[2:.n] - sub[1:(.n-1)] - 1), = grp] tt2 <- tt2[, list(list(v1)),by=grp]$v1 

this takes 26 seconds. it's 9 times faster. i'm wondering if it's possible more speedup i'll have recursively compute on 5-10 such 1 million elements. thank you.

here's data.table approach. edit added (dirty?) hack speeds , makes ~2x faster op data.table solution.

# compile data.table's, set appropriate keys xl <- sapply(x, length) yl <- sapply(y, length) xdt <- data.table(val=unlist(x, use.names=false), grp=rep(seq_along(xl), xl)) xdt[, cumval := cumsum(val), by=grp] ydt <- data.table(val=unlist(y, use.names=false), grp=rep(seq_along(yl), yl))  # hack #0, set key prevent sorting, since know data sorted setattr(ydt, 'sorted', c('grp', 'val'))  # setting key in y val , in x cumval can # leverage rolling joins setattr(xdt, 'sorted', c('grp', 'cumval'))  # hack #1 set key, prevent sorting vals = xdt[, cumval.copy := cumval][ydt, roll = -inf]  # hack #2, same deal above # know order of cumval , cumval.copy same # let's convince data.table in setattr(vals, 'sorted', c('grp', 'cumval.copy'))  # compute counts , fill in missing 0's # when there no y in appropriate x interval tt2 = vals[, .n, keyby = list(grp, cumval.copy)][xdt][is.na(n), n := 0l]  # convert list tt2 = tt2[order(grp, cumval.copy), list(list(n)), = grp]$v1 

Comments