Idris Raja

Archive for February, 2011|Monthly archive page

NPR Puzzle: Finding Synonyms with Python and WordNet

In github, hacking, NPR, python, word_puzzle on February 1, 2011 at 8:45 am

This week’s puzzle asks:

From Alan Meyer of Newberg, Ore.: Think of a common word that’s six letters long and includes a Q. Change the Q to an N, and rearrange the result to form a new word that’s a synonym of the first one. What are the words?

This puzzle is a good opportunity to play with some very cool computational language tools available through the Natural Langauge Toolik (NLTK). NLTK is a group of libraries and functions that contain powerful tools for symbolic and statistical natural language processing (NLP). Check out the free NLTK book and go through the first chapter and the examples and tutorials will blow your mind. Be forewarned and/or foredelighted that the book contains heavy amounts of linguistic jargon, stats, and programming.

Before we get to using the NLTK, let’s break down this puzzle into multiple steps.

1. Think of a common word that’s six letters long and includes a Q.
2. Change the Q to an N, and rearrange the result to form a new word
3. that’s a synonym of the first one

1. This is straight forward. We simply go through the English dictionary and grab any words with a ‘q’ that are 6 letters long.

2. This is also straight forward. We swap the ‘q’ with an ‘n’, then create all possible anagrams and check if any of those anagrams are in the English dictionary.

3. This step is the fun one. It requires the computer to understand the meaning of the words from step 1 and 2 and see if they are synonyms. How can a computer possibly do this? Here is when to use the NLTK. Included as part of the NLTK is WordNet, a lexical database, that among many other things, groups synonyms into groups called synsets. But it isn’t just a dumb list of words with similar meanings, because the same word can be used to mean very different things, and two synonyms of a word might not be synonyms for each other.

This is best demonstrated through example and the use of one our language’s most versatile words – shit.

idris@idris-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import wordnet as wn
>>> syn_sets = wn.synsets('shit')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('crap.n.01') synonyms:['crap', 'dirt', 'shit', 'shite', 'poop', 'turd']
Synset('bullshit.n.01') synonym:['bullshit', 'bull', 'Irish_bull', 'horseshit', 'shit', 'crap', 'dogshit']
Synset('jack.n.01') synonyms:['jack', 'doodly-squat', 'diddly-squat', 'diddlysquat', 'diddly-shit', 'diddlyshit', 'diddly', 'diddley', 'squat', 'shit']
Synset('shit.n.04') synonyms:['shit', 'dump']
Synset('asshole.n.01') synonyms:['asshole', 'bastard', 'cocksucker', 'dickhead', 'shit', 'mother_fucker', 'motherfucker', 'prick', 'whoreson', 'son_of_a_bitch', 'SOB']
Synset('damn.n.01') synonyms:['damn', 'darn', 'hoot', 'red_cent', 'shit', 'shucks', "tinker's_damn", "tinker's_dam"]
Synset('denounce.v.04') synonyms:['denounce', 'tell_on', 'betray', 'give_away', 'rat', 'grass', 'shit', 'shop', 'snitch', 'stag']
Synset('stool.v.04') synonyms:['stool', 'defecate', 'shit', 'take_a_shit', 'take_a_crap', 'ca-ca', 'crap', 'make']

So what I’ve done here is started Python in the terminal, and then installed the wordnet module from nltk.corpus. To show off WordNet, I use it to show all synonym sets which has ‘shit’ as a member. The member words of a particular synonym set are all synonyms of each other. For example, the synset named ‘stool.v.04’ contains all the words that are synonyms of the word stool in one of its verb forms, including stool, defecate, shit, take a shit, take a crap, ca ca, crap and make. Similarly, we see that shit is a member of synset ‘jack.n.01’, as in you aint got shit. Notice how the words in these two synsets are not synonyms of each other, for example, defecate from ‘stool.v.04’ is not equivalent in meaning to diddlysquat in ‘jack.n.01’. But those words are both synonyms for at least one meaning of shit.

Back to the puzzle at hand, we need to see if any of the words from step 1 and step 2 are synonyms. I wrote the simple Python function check_synonym to take two words and see if the two are in any synonym sets together.

def check_synonym(word, word2):
    """checks to see if word and word2 are synonyms"""
    l_syns = list()
    synsets = wn.synsets(word)
    for synset in synsets:
        if word2 in synset.lemma_names:
            l_syns.append( (word, word2) )
    return l_syns

Going through all the word pairs generated in step 1 and 2, I find there is only one pair of words that satisfy all constraints – uneasy and queasy. Below are the synonym sets that contain queasy:

>>> syn_sets = wn.synsets('queasy')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('nauseating.s.01') synonyms: ['nauseating', 'nauseous', 'noisome', 'queasy', 'loathsome', 'offensive', 'sickening', 'vile']
Synset('nauseated.s.01') synonyms: ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish']
Synset('anxious.s.02') synonyms: ['anxious', 'nervous', 'queasy', 'uneasy', 'unquiet']

As we can see, queasy and uneasy are both members of the ‘anxious.s.02’ synonym set. This is just the tip of the iceberg of what NLTK and WordNet can do. Both are used for cutting edge research and applications that use computers to read and understand text data. IBM’s Watson is one bleeding edge example of what is possible with computers and natural language processing. Cool, huh?

Full code available here at github here.