java - ws4j returns infinity for similarity measures that should return 1 -

i have simple code taken this example, using lin, path , wu-palmer similarity measures compute similarity between 2 words. code follows:

import edu.cmu.lti.lexical_db.ilexicaldatabase; import edu.cmu.lti.lexical_db.nictwordnet; import edu.cmu.lti.ws4j.relatednesscalculator; import edu.cmu.lti.ws4j.impl.lin; import edu.cmu.lti.ws4j.impl.path; import edu.cmu.lti.ws4j.impl.wupalmer;  public class test {     private static ilexicaldatabase db = new nictwordnet();     private static relatednesscalculator lin = new lin(db);     private static relatednesscalculator wup = new wupalmer(db);     private static relatednesscalculator path = new path(db);      public static void main(string[] args) {         string w1 = "walk";         string w2 = "trot";         system.out.println(lin.calcrelatednessofwords(w1, w2));         system.out.println(wup.calcrelatednessofwords(w1, w2));         system.out.println(path.calcrelatednessofwords(w1, w2));     } }

and scores expected except when both words identical. if both words same (e.g. w1 = "walk"; w2 = "walk";), 3 measures have should each return 1.0. instead, returning 1.7976931348623157e308.

i have used ws4j before (the same version, in fact), have never seen behavior. searching online has not yielded clues. possibly going wrong here?

p.s. fact lin, wu-palmer , path measures should return 1 can verified the online demo provided ws4j

i had similar problem, , here's what's going on here. hope other people run problem find response helpful.

if have noticed, online demo allows choose word sense specifying word in following format: word#pos_tag#word_sense. example, noun gender first word sense gender#n#1.

your code snippet uses first word sense default. when calculate wupalmer similarity between "gender" , "sex", return 0.26. if use online demo, return 1.0. if use "gender#n#1" , "sex#n#1" online demo return 0.26, there no discrepancy. online demo calculates max of pos tag / word sense pairs. here's corresponding snippet of code should trick:

ilexicaldatabase db = new nictwordnet(); ws4jconfiguration.getinstance().setmfs(true); relatednesscalculator rc = new lin(db); string word1 = "gender"; string word2 = "sex"; list<pos[]> pospairs = rc.getpospairs(); double maxscore = -1d;  for(pos[] pospair: pospairs) {     list<concept> synsets1 = (list<concept>)db.getallconcepts(word1, pospair[0].tostring());     list<concept> synsets2 = (list<concept>)db.getallconcepts(word2, pospair[1].tostring());      for(concept synset1: synsets1) {         (concept synset2: synsets2) {             relatedness relatedness = rc.calcrelatednessofsynset(synset1, synset2);             double score = relatedness.getscore();             if (score > maxscore) {                  maxscore = score;             }         }     } }  if (maxscore == -1d) {     maxscore = 0.0; }  system.out.println("sim('" + word1 + "', '" + word2 + "') =  " + maxscore);

also, give 0.0 similarity on non-stemmed word forms, e.g. 'genders' , 'sex.' can use porter stemmer included in ws4j make sure stem words beforehand if needed.

hope helps!

Brazier

Search This Blog

java - ws4j returns infinity for similarity measures that should return 1 -

Comments

Post a Comment