i'm looking faster solution problem below. i'll illustrate problem small example , provide code simulate large data that's point of question. actual problem size of list length = 1 million entries.
say, i've 2 lists shown below:
x <- list(c(82, 18), c(35, 50, 15)) y <- list(c(1,2,3,55,90), c(37,38,95)) properties of x , y:
- each element of list
xsums 100. - each element of
ysorted , between 1 , 100.
the problem:
now, i'd this. taking x[[1]] , y[[1]], i'd find count of numbers in y[[1]] 1) <= 82 , 2) > 82 , <= 100. be, c(4, 1) because numbers <= 82 c(1,2,3,55) , number between 83 , 100 c(90). x[[2]] , y[[2]], c(0, 2, 1). is, answer should be:
[[1]] [1] 4 1 [[2]] [1] 0 2 1 let me know if still unclear.
simulated data 1 million entries
set.seed(1) n <- 100 n <- 1e6 len <- sample(2:3, n, true) x <- lapply(seq_len(n), function(ix) { probs <- sample(100:1000, len[ix]) probs <- probs/sum(probs) oo <- round(n * probs) if (sum(oo) != 100) { oo[1] <- oo[1] + (100 - sum(oo)) } oo }) require(data.table) ss <- sample(1:10, n, true) dt <- data.table(val=sample(1:n, sum(ss), true), grp=rep(seq_len(n), ss)) setkey(dt, grp, val) y <- dt[, list(list(val)),by=grp]$v1 what i've done far:
using mapply (slow):
i thought of using rank ties.method="first" , mapply (obvious choice 2 lists) first , tried out this:
tt1 <- mapply(y, x, fun=function(a,b) { tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1) }) although works fine, takes lot of time on 1m entries. think overhead of computing rank , diff many times adds it. takes 241 seconds!
therefore, decided try , overcome usage of rank , diff using data.table , sorting "group" column. came longer much faster solution shown below:
using data.table (faster):
xl <- sapply(x, length) yl <- sapply(y, length) xdt <- data.table(val=unlist(x, use.names=false), grp=rep(seq_along(xl), xl), type = "x") xdt[, cumval := cumsum(val), by=grp] ydt <- data.table(val=unlist(y, use.names=false), grp=rep(seq_along(yl), yl), type = "y") tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)])) setkey(tt2, grp, val) xdt.pos <- which(tt2$type == "x") tt2[, type.x := 0l][xdt.pos, type.x := xdt.pos] tt2 <- tt2[xdt.pos][tt2[, .n, = grp][, n := cumsum(c(0, head(n, -1)))]][, sub := type.x - n] tt2[, val := xdt$val] # time consuming step tt2 <- tt2[, c(sub[1]-1, sub[2:.n] - sub[1:(.n-1)] - 1), = grp] tt2 <- tt2[, list(list(v1)),by=grp]$v1 this takes 26 seconds. it's 9 times faster. i'm wondering if it's possible more speedup i'll have recursively compute on 5-10 such 1 million elements. thank you.
here's data.table approach. edit added (dirty?) hack speeds , makes ~2x faster op data.table solution.
# compile data.table's, set appropriate keys xl <- sapply(x, length) yl <- sapply(y, length) xdt <- data.table(val=unlist(x, use.names=false), grp=rep(seq_along(xl), xl)) xdt[, cumval := cumsum(val), by=grp] ydt <- data.table(val=unlist(y, use.names=false), grp=rep(seq_along(yl), yl)) # hack #0, set key prevent sorting, since know data sorted setattr(ydt, 'sorted', c('grp', 'val')) # setting key in y val , in x cumval can # leverage rolling joins setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, prevent sorting vals = xdt[, cumval.copy := cumval][ydt, roll = -inf] # hack #2, same deal above # know order of cumval , cumval.copy same # let's convince data.table in setattr(vals, 'sorted', c('grp', 'cumval.copy')) # compute counts , fill in missing 0's # when there no y in appropriate x interval tt2 = vals[, .n, keyby = list(grp, cumval.copy)][xdt][is.na(n), n := 0l] # convert list tt2 = tt2[order(grp, cumval.copy), list(list(n)), = grp]$v1
Comments
Post a Comment