i working on small data analysis tool, , practicing/learning scala in process. got stuck @ small problem.
assume data of type:
x gr1 x_11 ... x_1n x gr2 x_21 ... x_2n .. x grk x_k1 ... x_kn y gr1 y_11 ... y_1n y gr3 y_31 ... y_3n .. y gr(k-1) ... here have entries (x,y...) may or may not exist in k groups, series of values each group. want pretty simple (in theory), consolidate rows belong same "entity" in different groups. instead of multiple lines start x, want have 1 row values x_11 x_kn in columns.
what makes things complicated not entities exist in groups. wherever there's "missing data" pad instance zeroes, or string denotes missing value. if have (x,y,z) in 3 groups, type table want have follows:
x x_11 x_12 x_21 x_22 x_31 x_32 y y_11 y_12 n/a n/a y_31 y_32 z n/a n/a z_21 z_22 n/a n/a i have been stuck trying figure out, there smart way use list functions solve this?
i wrote simple loop:
for { (id, hitlist) <- hits.groupby(_.acc) h <- hitlist } println(id + "\t" + h.sampleid + "\t" + h.ratios.mkstring("\t")) to able generate tables example above. note that, original data of different format , layout,but has little problem @ hand, have skipped steps regarding parsing. should able use groupby in better way solves me, can't seem there.
then modified loop mapping hits ratios , appending them 1 another:
for ((id, hitlist) <- hits.groupby(_.acc)){ val l = hitlist.map(_.ratios).foldright(list[double]()){ (l1: list[double], l2: list[double]) => l1 ::: l2 } println(id + "\t" + l.mkstring("\t")) //println(id + "\t" + h.sampleid + "\t" + h.ratios.mkstring("\t")) } that gets me 1 step closer still no cigar! instead of padded "matrix" jagged table. taking example above:
x x_11 x_12 x_21 x_22 x_31 x_32 y y_11 y_12 y_31 y_32 z z_21 z_22 any ideas how can pad table values respective groups aligned 1 another? should able use _.sampleid, holds "group membersip" each "hit", not sure how exactly. ´hits´ list of type hit practically wrapper each row, giving convenience methods getting individual values, tuple have "named indices" (such .acc, .sampleid..)
(i solve problem without hardcoding number of groups, might change case case)
thanks!
this bit of contrived example, think can see going:
case class hit(acc:string, subacc:string, value:int) val hits = list(hit("x", "x_11", 1), hit("x", "x_21", 2), hit("x", "x_31", 3)) val kmax = 4 val nmax = 2 { (id, hitlist) <- hits.groupby(_.acc) k <- 1 kmax n <- 1 nmax } yield { val subid = "x_%s%s".format(k, n) val row = hitlist.find(h => h.subacc == subid).getorelse(hit(id, subid, 0)) println(row) } //prints hit(x,x_11,1) hit(x,x_12,0) hit(x,x_21,2) hit(x,x_22,0) hit(x,x_31,3) hit(x,x_32,0) hit(x,x_41,0) hit(x,x_42,0) if provide more information on hits lists come little more accurate.
Comments
Post a Comment