Idris Raja

Archive for the ‘NPR’ Category

NPR Puzzle: Finding Synonyms with Python and WordNet

In github, hacking, NPR, python, word_puzzle on February 1, 2011 at 8:45 am

This week’s puzzle asks:

From Alan Meyer of Newberg, Ore.: Think of a common word that’s six letters long and includes a Q. Change the Q to an N, and rearrange the result to form a new word that’s a synonym of the first one. What are the words?

This puzzle is a good opportunity to play with some very cool computational language tools available through the Natural Langauge Toolik (NLTK). NLTK is a group of libraries and functions that contain powerful tools for symbolic and statistical natural language processing (NLP). Check out the free NLTK book and go through the first chapter and the examples and tutorials will blow your mind. Be forewarned and/or foredelighted that the book contains heavy amounts of linguistic jargon, stats, and programming.

Before we get to using the NLTK, let’s break down this puzzle into multiple steps.

1. Think of a common word that’s six letters long and includes a Q.
2. Change the Q to an N, and rearrange the result to form a new word
3. that’s a synonym of the first one

Approach:
1. This is straight forward. We simply go through the English dictionary and grab any words with a ‘q’ that are 6 letters long.

2. This is also straight forward. We swap the ‘q’ with an ‘n’, then create all possible anagrams and check if any of those anagrams are in the English dictionary.

3. This step is the fun one. It requires the computer to understand the meaning of the words from step 1 and 2 and see if they are synonyms. How can a computer possibly do this? Here is when to use the NLTK. Included as part of the NLTK is WordNet, a lexical database, that among many other things, groups synonyms into groups called synsets. But it isn’t just a dumb list of words with similar meanings, because the same word can be used to mean very different things, and two synonyms of a word might not be synonyms for each other.

This is best demonstrated through example and the use of one our language’s most versatile words – shit.

idris@idris-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import wordnet as wn
>>> syn_sets = wn.synsets('shit')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('crap.n.01') synonyms:['crap', 'dirt', 'shit', 'shite', 'poop', 'turd']
Synset('bullshit.n.01') synonym:['bullshit', 'bull', 'Irish_bull', 'horseshit', 'shit', 'crap', 'dogshit']
Synset('jack.n.01') synonyms:['jack', 'doodly-squat', 'diddly-squat', 'diddlysquat', 'diddly-shit', 'diddlyshit', 'diddly', 'diddley', 'squat', 'shit']
Synset('shit.n.04') synonyms:['shit', 'dump']
Synset('asshole.n.01') synonyms:['asshole', 'bastard', 'cocksucker', 'dickhead', 'shit', 'mother_fucker', 'motherfucker', 'prick', 'whoreson', 'son_of_a_bitch', 'SOB']
Synset('damn.n.01') synonyms:['damn', 'darn', 'hoot', 'red_cent', 'shit', 'shucks', "tinker's_damn", "tinker's_dam"]
Synset('denounce.v.04') synonyms:['denounce', 'tell_on', 'betray', 'give_away', 'rat', 'grass', 'shit', 'shop', 'snitch', 'stag']
Synset('stool.v.04') synonyms:['stool', 'defecate', 'shit', 'take_a_shit', 'take_a_crap', 'ca-ca', 'crap', 'make']

So what I’ve done here is started Python in the terminal, and then installed the wordnet module from nltk.corpus. To show off WordNet, I use it to show all synonym sets which has ‘shit’ as a member. The member words of a particular synonym set are all synonyms of each other. For example, the synset named ‘stool.v.04’ contains all the words that are synonyms of the word stool in one of its verb forms, including stool, defecate, shit, take a shit, take a crap, ca ca, crap and make. Similarly, we see that shit is a member of synset ‘jack.n.01’, as in you aint got shit. Notice how the words in these two synsets are not synonyms of each other, for example, defecate from ‘stool.v.04’ is not equivalent in meaning to diddlysquat in ‘jack.n.01’. But those words are both synonyms for at least one meaning of shit.

Back to the puzzle at hand, we need to see if any of the words from step 1 and step 2 are synonyms. I wrote the simple Python function check_synonym to take two words and see if the two are in any synonym sets together.

def check_synonym(word, word2):
    """checks to see if word and word2 are synonyms"""
    l_syns = list()
    synsets = wn.synsets(word)
    for synset in synsets:
        if word2 in synset.lemma_names:
            l_syns.append( (word, word2) )
    return l_syns

Going through all the word pairs generated in step 1 and 2, I find there is only one pair of words that satisfy all constraints – uneasy and queasy. Below are the synonym sets that contain queasy:

>>> syn_sets = wn.synsets('queasy')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('nauseating.s.01') synonyms: ['nauseating', 'nauseous', 'noisome', 'queasy', 'loathsome', 'offensive', 'sickening', 'vile']
Synset('nauseated.s.01') synonyms: ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish']
Synset('anxious.s.02') synonyms: ['anxious', 'nervous', 'queasy', 'uneasy', 'unquiet']

As we can see, queasy and uneasy are both members of the ‘anxious.s.02’ synonym set. This is just the tip of the iceberg of what NLTK and WordNet can do. Both are used for cutting edge research and applications that use computers to read and understand text data. IBM’s Watson is one bleeding edge example of what is possible with computers and natural language processing. Cool, huh?

Full code available here at github here.

Advertisements

NPR Will Shortz word puzzle 1/16/2011, Solution on GitHub!

In github, hacking, NPR, python, word_puzzle on January 22, 2011 at 12:12 am

This week’s puzzle:

From listener Mike Shteyman of Reisterstown, Md.: Take the first seven letters of the alphabet, A through G, change one of these letters to another letter that is also either A, B, C, D, E, F or G. Rearrange the result to spell a familiar seven-letter word. What word is it?

This puzzle was an easy one to solve with Python and I don’t think I could have solved it the old-fashioned way of ‘only’ using my brain.

I’ve been attemting to solve these puzzles for a few months now, and I’ve had to write certain functions over and over, so I took the time this week to consolidate some of the more common functions I’ve used into a utility file, util.py.

I also went through the awesome Git tutorial Git Immersion, with much thanks to EdgeCase Software Artisans and Jim Weirich. With a decent grasp of the basics of Git, I decided to put all the code up on GitHub here. From now on I’ll use git for any coding I do, and I’ll probably end up hosting a bunch of it as publicly available code on GitHub.

Back to this week’s puzzle – the first step is to create all possible strings that start with ‘abcdefg’ and then swap one of those letters with a letter from the same string. We replace each character (7 total in ‘adcdefg’) with 6 possible replacements (any character expect the original one), and have a total of 42 (6*7) strings to test for anagrams.

Each of the 42 strings has one letter repeated twice. The number of arrangements/anagrams for a string of 7 characters with one repeat is 7!/2! = 2,520.

idris@idris-laptop:~/work/npr_puzzles$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.factorial(7) / math.factorial(2)
2520

So there are 2,520 * 42 = 105,840 anagrams to look up in the dictionary for existence, which takes less than a second to execute.

The only answer that showed up was “feedbag”, which you can get by replacing the “c” in “abcdefg” with an “e” which gets you ‘abedefg’ which can be rearranged to “feedbag.”

The puzzle said the answer should be “a familiar seven-letter word.” I suppose “feedbag” qualifies if you, like puzzle submitter Mike Shteyman, hail from the unincorporated Maryland town of Reisterstown.

Looking forward to see if the answer is correct, next week’s puzzle, and sharing more code on GitHub!

NPR Will Shortz Puzzle: 12/26/2010

In hacking, NPR, word_puzzle on January 5, 2011 at 11:26 pm

This week’s puzzle was a fun one.

Name a famous American from the past who has seven letters in his or her last name. Take the last two letters, plus the first four letters, in that order, and you’ll name that person’s profession. Who is it?

There are several parts to this puzzle and it’s easiest to tackle each one individually.

1) Think of a famous American whose last name is seven letters long.     This part is very broad, and there are possibly thousands or tens of thousands of names that can fit this description. My brain certainly can’t easily categorize people into groups based on length of last name.

2) Take his / her last name, and move the last two letters in front of the first four letters and you have the person’s profession.

Again, moving the letters is difficult for my brain, and I couldn’t think of an easy way of doing this part in reverse: thinking of a profession, and then forming the last name to see if it matches a known famous American.

So, of course and as usual, I’ll use Python, command-line fu, Wikipedia, and an English word list to solve this puzzle.

Step 1 – We’re looking for a famous American, and therefore someone whose name is in Wikipedia. I assume that if you don’t have an English Wikipedia entry, you aren’t a famous American. In our Wikipedia list, underscores replaced spaces, into the form firstname_lastname. I ignore all entries that don’t have at least one underscore, as the entry is unlikely to be a name.

def has_underscore_middle(s):
    """returns true if string 's' has underscore not including end or beginning"""
    s = s[1:-1]
    return s.find('_') > -1

Next, I test the last word in the entry to see if it has seven letters. It’s possible that a name like John F. Kennedy, Jr. would be incorrectly eliminated because of the suffix Jr., but I choose to ignore such edge cases for now. If our method doesn’t yield a solution, we can go back and possibly make this refinement.

def has_right_length(s, l = 7):
    """return True if string 's' is correct length 'l'=7"""
    return len(s) == l

Next, I test to see that all characters in the last name are ASCII characters. Wikipedia entries can have grammar marks, numbers, foreign characters, Unicode characters, etc., all of which are unlikely to be contained in the last name of a famous American. Again, this assumption can be modified if this approach doesn’t yield a solution.

def has_ascii_letters_only(s):
    """return True if string 's' is made up only of ascii letters"""
    for char in s:
        if char not in string.ascii_letters:
            return False
    return True

Step 2 – Once we have a list of Wikipedia entries that look like names and have the correct last name length, we are ready to see if moving the letters around in the last name return a profession. I don’t have a list of professions, but I do have a list of all words in the English dictionary. This list from WordNet is specifically called the crossword dictionary as it is all valid entries in English crosswords. There are about 110,000 words in this list and I assume that the list of professions is a subset of this dictionary.

With that assumption, I can move the letters around in the last name and see if it is in the dictionary. If yes, I output the word from the dictionary, the last name, and the original Wikipedia entry onto one line in the output file.

def get_from_underscore(s):
    """accepts string 's' that is expected to have "_"
    return from "_" on, exclusive"""
    reverse = s[::-1]
    loc_ = reverse.find('_')
    return s[-loc_:]

def mix_last_name(s):
    """Takes a seven letter string 's' and creates new 6 letter string
    with last 2 letters of 's' + first four letters of 's'
    Ex: "abcdefg" --> "fgabcd" """
    return s[-2:] + s[:4]

Running Step 1 and Step 2 takes approximately 40 seconds. Now I need to manually scan the list of 1,507 entries and see if I can find a profession.

The file is unsorted, which I could fix in Python but instead I’ll quickly run the command line function sort and do

sort -o output_sort.txt output.txt

which sorts output.txt by the first letter in the line and puts the result in output_sort.txt.

Scanning through the list, we quickly get lucky and see several lines which found ‘author’ as the word in the dictionary. There are several Wikipedia entries that are variations on the name Henry David Thoreau, certainly a famous American and a likely answer for an NPR contest.

This solution required a few assumptions, and some a priori knowledge such as a general knowledge of professions and famous Americans, but otherwise everything else was outsourced to the computer and Wikipedia.

You can see the full code here at github.

Looking forward to next week’s puzzle which looks like a lot of fun, as we’ll have to find a way to find a two-word synonym for a single word, which will require a thesaurus and possibly more.