i want counts of character level ngrams presnt in text file. using r wrote small code same. code neglecting digits present in text. me in fixing issue.
here code :
library(tau) temp<-read.csv("/home/aravi/documents/sample/csv/ex.csv",header=true,stringsasfactors=f) r<-textcnt(temp, method="ngram",n=4l, decreasing=true) a<-data.frame(counts = unclass(r), size = nchar(names(r))) b<-split(a,a$size) b here contents of input file:
abcd123 appl2345e coun56ry live123 names3423bsdf coun56ryas this output:
$`1` counts size _ 18 1 3 1 e 3 1 n 3 1 s 3 1 c 2 1 l 2 1 o 2 1 p 2 1 r 2 1 u 2 1 y 2 1 b 1 1 d 1 1 f 1 1 1 1 m 1 1 v 1 1 $`2` counts size _c 2 2 _r 2 2 co 2 2 e_ 2 2 n_ 2 2 ou 2 2 ry 2 2 s_ 2 2 un 2 2 _a 1 2 _b 1 2 _e 1 2 _l 1 2 _n 1 2 1 2 ap 1 2 1 2 bs 1 2 df 1 2 es 1 2 f_ 1 2 iv 1 2 l_ 1 2 li 1 2 me 1 2 na 1 2 pl 1 2 pp 1 2 sd 1 2 ve 1 2 y_ 1 2 ya 1 2 $`3` counts size _co 2 3 _ry 2 3 cou 2 3 oun 2 3 un_ 2 3 _ap 1 3 _bs 1 3 _e_ 1 3 _li 1 3 _na 1 3 ame 1 3 app 1 3 as_ 1 3 bsd 1 3 df_ 1 3 es_ 1 3 ive 1 3 liv 1 3 mes 1 3 nam 1 3 pl_ 1 3 ppl 1 3 ry_ 1 3 rya 1 3 sdf 1 3 ve_ 1 3 yas 1 3 $`4` counts size _cou 2 4 coun 2 4 oun_ 2 4 _app 1 4 _bsd 1 4 _liv 1 4 _nam 1 4 _ry_ 1 4 _rya 1 4 ames 1 4 appl 1 4 bsdf 1 4 ive_ 1 4 live 1 4 mes_ 1 4 name 1 4 ppl_ 1 4 ryas 1 4 sdf_ 1 4 yas_ 1 4 could tell missing or went wrong. in advance.
the default value splits in textcnt includes "digits" , numbers being treated delimiters. remove , things work.
Comments
Post a Comment