i have simple code taken this example, using lin, path , wu-palmer similarity measures compute similarity between 2 words. code follows:
import edu.cmu.lti.lexical_db.ilexicaldatabase; import edu.cmu.lti.lexical_db.nictwordnet; import edu.cmu.lti.ws4j.relatednesscalculator; import edu.cmu.lti.ws4j.impl.lin; import edu.cmu.lti.ws4j.impl.path; import edu.cmu.lti.ws4j.impl.wupalmer; public class test { private static ilexicaldatabase db = new nictwordnet(); private static relatednesscalculator lin = new lin(db); private static relatednesscalculator wup = new wupalmer(db); private static relatednesscalculator path = new path(db); public static void main(string[] args) { string w1 = "walk"; string w2 = "trot"; system.out.println(lin.calcrelatednessofwords(w1, w2)); system.out.println(wup.calcrelatednessofwords(w1, w2)); system.out.println(path.calcrelatednessofwords(w1, w2)); } } and scores expected except when both words identical. if both words same (e.g. w1 = "walk"; w2 = "walk";), 3 measures have should each return 1.0. instead, returning 1.7976931348623157e308.
i have used ws4j before (the same version, in fact), have never seen behavior. searching online has not yielded clues. possibly going wrong here?
p.s. fact lin, wu-palmer , path measures should return 1 can verified the online demo provided ws4j
i had similar problem, , here's what's going on here. hope other people run problem find response helpful.
if have noticed, online demo allows choose word sense specifying word in following format: word#pos_tag#word_sense. example, noun gender first word sense gender#n#1.
your code snippet uses first word sense default. when calculate wupalmer similarity between "gender" , "sex", return 0.26. if use online demo, return 1.0. if use "gender#n#1" , "sex#n#1" online demo return 0.26, there no discrepancy. online demo calculates max of pos tag / word sense pairs. here's corresponding snippet of code should trick:
ilexicaldatabase db = new nictwordnet(); ws4jconfiguration.getinstance().setmfs(true); relatednesscalculator rc = new lin(db); string word1 = "gender"; string word2 = "sex"; list<pos[]> pospairs = rc.getpospairs(); double maxscore = -1d; for(pos[] pospair: pospairs) { list<concept> synsets1 = (list<concept>)db.getallconcepts(word1, pospair[0].tostring()); list<concept> synsets2 = (list<concept>)db.getallconcepts(word2, pospair[1].tostring()); for(concept synset1: synsets1) { (concept synset2: synsets2) { relatedness relatedness = rc.calcrelatednessofsynset(synset1, synset2); double score = relatedness.getscore(); if (score > maxscore) { maxscore = score; } } } } if (maxscore == -1d) { maxscore = 0.0; } system.out.println("sim('" + word1 + "', '" + word2 + "') = " + maxscore); also, give 0.0 similarity on non-stemmed word forms, e.g. 'genders' , 'sex.' can use porter stemmer included in ws4j make sure stem words beforehand if needed.
hope helps!
Comments
Post a Comment