Digits being neglected while performing N-gram in R -

i want counts of character level ngrams presnt in text file. using r wrote small code same. code neglecting digits present in text. me in fixing issue.

here code :

 library(tau) temp<-read.csv("/home/aravi/documents/sample/csv/ex.csv",header=true,stringsasfactors=f) r<-textcnt(temp, method="ngram",n=4l, decreasing=true) a<-data.frame(counts = unclass(r), size = nchar(names(r))) b<-split(a,a$size) b

here contents of input file:

abcd123 appl2345e coun56ry live123 names3423bsdf coun56ryas

this output:

  $`1`   counts size _     18    1      3    1 e      3    1 n      3    1 s      3    1 c      2    1 l      2    1 o      2    1 p      2    1 r      2    1 u      2    1 y      2    1 b      1    1 d      1    1 f      1    1      1    1 m      1    1 v      1    1  $`2`    counts size _c      2    2 _r      2    2 co      2    2 e_      2    2 n_      2    2 ou      2    2 ry      2    2 s_      2    2 un      2    2 _a      1    2 _b      1    2 _e      1    2 _l      1    2 _n      1    2      1    2 ap      1    2      1    2 bs      1    2 df      1    2 es      1    2 f_      1    2 iv      1    2 l_      1    2 li      1    2 me      1    2 na      1    2 pl      1    2 pp      1    2 sd      1    2 ve      1    2 y_      1    2 ya      1    2  $`3`     counts size _co      2    3 _ry      2    3 cou      2    3 oun      2    3 un_      2    3 _ap      1    3 _bs      1    3 _e_      1    3 _li      1    3 _na      1    3 ame      1    3 app      1    3 as_      1    3 bsd      1    3 df_      1    3 es_      1    3 ive      1    3 liv      1    3 mes      1    3 nam      1    3 pl_      1    3 ppl      1    3 ry_      1    3 rya      1    3 sdf      1    3 ve_      1    3 yas      1    3  $`4`      counts size _cou      2    4 coun      2    4 oun_      2    4 _app      1    4 _bsd      1    4 _liv      1    4 _nam      1    4 _ry_      1    4 _rya      1    4 ames      1    4 appl      1    4 bsdf      1    4 ive_      1    4 live      1    4 mes_      1    4 name      1    4 ppl_      1    4 ryas      1    4 sdf_      1    4 yas_      1    4

could tell missing or went wrong. in advance.

the default value splits in textcnt includes "digits" , numbers being treated delimiters. remove , things work.

Brazier

Search This Blog

Digits being neglected while performing N-gram in R -

Comments

Post a Comment