There many methods of determining similarity and difference between terms. None are simpler to implement than the Levenshtein edit distance
– but in many ways, this algorithm is grossly insufficient, because it doesn’t take into consideration a word’s meaning or sense (at all!). For accuracy, Wu-Palmer
is the all-around best.
6 Algorithms
Wu-Palmer
– returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their least common subsumer. It weights the edges based on distance in the hierarchy. For example, jumping from inanimate to animate is a larger distance than jumping from say Felid to Canid. In some sense we can think of it as sort of edit distance, assigning type changing operations a higher cost the higher they are in the hierarchy.Levenshtein
– measures the minimum number of single character edits required to change one word into another. This is simply counting the number of string transformations needed to get from string a to string b. It does not take into consideration meaning.Path Similarity
– a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. Information Content can only be computed for nouns and verbs in WordNet, since these are the only parts of speech where concepts are organized in hierarchiesJiang-Cornath similarity
– based on Resnik’s similarity; considers the information content of lowest common subsumer (lcs) and the two compared concepts.Leacock-Chordorow similarity
– uses path similarity to compute the shortest number of edges from one word sense to another word sense, assuming a hierarchical structure.Lin similarity
– based on Resnik’s similarity; considers the information content of lowest common subsumer (lcs) and the two compared concepts.
Words similar to “yell”
Using all 6 algorithms (on verbs), you can see that Wu-Palmer
performs the best: scream is closest to yell – with a similarity score of 1; whereas whisper is the furthest, with a similarity score of .22. Note however: Wu-Palmer will not work with adjectives.
Word | Computational Difference with "yell" | |||||
---|---|---|---|---|---|---|
Wup | Lev | Path | LCH | Jng | Lin | |
scream | 1.0 | 5 | 1.0 | x | 1e+300 | 1.0 |
cry | 1.0 | 4 | 1.0 | x | 1e+300 | 1.0 |
wail | 0.8 | 3 | 0.5 | x | 0.37 | 0.85 |
groan | 0.33 | 5 | 0.33 | x | 0.11 | 0.57 |
speak | 0.25 | 4 | 0.14 | x | 0.08 | 0.0 |
whisper | 0.22 | 7 | 0.12 | x | 0.05 | 0.0 |
The nltk python code
If you’d care to play around with this, here’s the python
code.
import nltk
from nltk.corpus import wordnet
from pprint import pprint
from nltk.corpus import wordnet_ic
# Specify information content corpus for path-based similarity computations
brown_ic = wordnet_ic.ic('ic-brown.dat')
# Choose the word and the synset
word = 'yell'
word_synset = wordnet.synset('yell.v.01')
# Select some comparison words
synonyms = [
'scream',
'speak',
'whisper',
'groan',
'cry',
'wail'
]
# Set up a list to hold the similarity measurements
comparisons = []
# iterate over compared words
for c in synonyms:
# 1. Levenshtein distance is by far, the most straightforward (not very accurate though)
lev = nltk.edit_distance(word, c)
# 2. Wu-Palmer Similarity
wup = round(word_synset.wup_similarity(wordnet.synset('%s.%s.01' % (c, pos))), 2)
# 3. Path similarity
pth = round(word_synset.path_similarity(wordnet.synset('%s.%s.01' % (c, pos))), 2)
# 4. Jian-Conrath similarity
try:
jng = round(word_synset.jcn_similarity(wordnet.synset('%s.%s.01' % (c, pos)), brown_ic), 2)
except:
jng = None
# 5. Leacock-Chodorow Similarity
try:
lch = round(word_synset.lch_similarity(this_synset, simulate_root=False), 2)
except:
lch = None
# 6. Lin similarity abstry:
try:
lin = round(word_synset.lin_similarity(wordnet.synset('%s.%s.01' % (c, pos)), brown_ic), 2)
except:
lin = None
# Record the results
comparisons.append({
"word": c,
"lev": lev,
"wup": wup,
"pth": pth,
"jng": jng,
"lch": lch,
"lin": lin
})
# Sort the comparisons (DESCENDING) by Wu-Palmer
comparisons = sorted(comparisons, key=lambda x: -x['wup'])
# Print the results
pprint(comparisons)
Here's what you wind up with...
[{'jng': 1e+300,
'lch': None,
'lev': 5,
'lin': 1.0,
'pth': 1.0,
'word': 'scream',
'wup': 1.0},
{'jng': 1e+300,
'lch': None,
'lev': 4,
'lin': 1.0,
'pth': 1.0,
'word': 'cry',
'wup': 1.0},
{'jng': 0.37,
'lch': None,
'lev': 3,
'lin': 0.85,
'pth': 0.5,
'word': 'wail',
'wup': 0.8},
{'jng': 0.11,
'lch': None,
'lev': 5,
'lin': 0.57,
'pth': 0.33,
'word': 'groan',
'wup': 0.33},
{'jng': 0.08,
'lch': None,
'lev': 4,
'lin': 0.0,
'pth': 0.14,
'word': 'speak',
'wup': 0.25},
{'jng': 0.05,
'lch': None,
'lev': 7,
'lin': 0.0,
'pth': 0.12,
'word': 'whisper',
'wup': 0.22}]
Other direct and derivational methods of gauging similarity and difference
Concordance maps
Entailments
Keywords in Context (aka:
kwic
)Hypernyms
Holonyms
Markov chains
Meronyms
n-grams
Pertainyms
Word collocations