Idris Raja

Archive for the ‘github’ Category

NPR Puzzle: Finding Synonyms with Python and WordNet

In github, hacking, NPR, python, word_puzzle on February 1, 2011 at 8:45 am

This week’s puzzle asks:

From Alan Meyer of Newberg, Ore.: Think of a common word that’s six letters long and includes a Q. Change the Q to an N, and rearrange the result to form a new word that’s a synonym of the first one. What are the words?

This puzzle is a good opportunity to play with some very cool computational language tools available through the Natural Langauge Toolik (NLTK). NLTK is a group of libraries and functions that contain powerful tools for symbolic and statistical natural language processing (NLP). Check out the free NLTK book and go through the first chapter and the examples and tutorials will blow your mind. Be forewarned and/or foredelighted that the book contains heavy amounts of linguistic jargon, stats, and programming.

Before we get to using the NLTK, let’s break down this puzzle into multiple steps.

1. Think of a common word that’s six letters long and includes a Q.
2. Change the Q to an N, and rearrange the result to form a new word
3. that’s a synonym of the first one

Approach:
1. This is straight forward. We simply go through the English dictionary and grab any words with a ‘q’ that are 6 letters long.

2. This is also straight forward. We swap the ‘q’ with an ‘n’, then create all possible anagrams and check if any of those anagrams are in the English dictionary.

3. This step is the fun one. It requires the computer to understand the meaning of the words from step 1 and 2 and see if they are synonyms. How can a computer possibly do this? Here is when to use the NLTK. Included as part of the NLTK is WordNet, a lexical database, that among many other things, groups synonyms into groups called synsets. But it isn’t just a dumb list of words with similar meanings, because the same word can be used to mean very different things, and two synonyms of a word might not be synonyms for each other.

This is best demonstrated through example and the use of one our language’s most versatile words – shit.

idris@idris-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import wordnet as wn
>>> syn_sets = wn.synsets('shit')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('crap.n.01') synonyms:['crap', 'dirt', 'shit', 'shite', 'poop', 'turd']
Synset('bullshit.n.01') synonym:['bullshit', 'bull', 'Irish_bull', 'horseshit', 'shit', 'crap', 'dogshit']
Synset('jack.n.01') synonyms:['jack', 'doodly-squat', 'diddly-squat', 'diddlysquat', 'diddly-shit', 'diddlyshit', 'diddly', 'diddley', 'squat', 'shit']
Synset('shit.n.04') synonyms:['shit', 'dump']
Synset('asshole.n.01') synonyms:['asshole', 'bastard', 'cocksucker', 'dickhead', 'shit', 'mother_fucker', 'motherfucker', 'prick', 'whoreson', 'son_of_a_bitch', 'SOB']
Synset('damn.n.01') synonyms:['damn', 'darn', 'hoot', 'red_cent', 'shit', 'shucks', "tinker's_damn", "tinker's_dam"]
Synset('denounce.v.04') synonyms:['denounce', 'tell_on', 'betray', 'give_away', 'rat', 'grass', 'shit', 'shop', 'snitch', 'stag']
Synset('stool.v.04') synonyms:['stool', 'defecate', 'shit', 'take_a_shit', 'take_a_crap', 'ca-ca', 'crap', 'make']

So what I’ve done here is started Python in the terminal, and then installed the wordnet module from nltk.corpus. To show off WordNet, I use it to show all synonym sets which has ‘shit’ as a member. The member words of a particular synonym set are all synonyms of each other. For example, the synset named ‘stool.v.04’ contains all the words that are synonyms of the word stool in one of its verb forms, including stool, defecate, shit, take a shit, take a crap, ca ca, crap and make. Similarly, we see that shit is a member of synset ‘jack.n.01’, as in you aint got shit. Notice how the words in these two synsets are not synonyms of each other, for example, defecate from ‘stool.v.04’ is not equivalent in meaning to diddlysquat in ‘jack.n.01’. But those words are both synonyms for at least one meaning of shit.

Back to the puzzle at hand, we need to see if any of the words from step 1 and step 2 are synonyms. I wrote the simple Python function check_synonym to take two words and see if the two are in any synonym sets together.

def check_synonym(word, word2):
    """checks to see if word and word2 are synonyms"""
    l_syns = list()
    synsets = wn.synsets(word)
    for synset in synsets:
        if word2 in synset.lemma_names:
            l_syns.append( (word, word2) )
    return l_syns

Going through all the word pairs generated in step 1 and 2, I find there is only one pair of words that satisfy all constraints – uneasy and queasy. Below are the synonym sets that contain queasy:

>>> syn_sets = wn.synsets('queasy')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('nauseating.s.01') synonyms: ['nauseating', 'nauseous', 'noisome', 'queasy', 'loathsome', 'offensive', 'sickening', 'vile']
Synset('nauseated.s.01') synonyms: ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish']
Synset('anxious.s.02') synonyms: ['anxious', 'nervous', 'queasy', 'uneasy', 'unquiet']

As we can see, queasy and uneasy are both members of the ‘anxious.s.02’ synonym set. This is just the tip of the iceberg of what NLTK and WordNet can do. Both are used for cutting edge research and applications that use computers to read and understand text data. IBM’s Watson is one bleeding edge example of what is possible with computers and natural language processing. Cool, huh?

Full code available here at github here.

NBA Champs Team Age

In github, hacking, python, R on January 31, 2011 at 7:15 am

For a while now I’ve wanted to start mucking with the free open source software (FOSS) stats and graphing program R. I needed a dataset to mess around, so I scraped the season box scores for every NBA regular season from 1950 to 2010 from basketball-reference.com.

I used Python to calculate an average age for each team by weighting each player’s age by the proportion of minutes played compared to the team’s total minutes for the regular season. For calculation purposes I use the player’s integer age on February 1st of the season.

The interesting thing about this data is to isolate the championship teams and see the age trends for the different dynasties. From the 1950s to the 1990s, each team gets older as it continues to win championships. This is not surprising, as each dynasty has a core of star players who get a year older each season. These core players play the majority of minutes, and heavily weight the average age. Both role players and bench players are usually recycled, with older ones being replaced by younger ones. This is why we can see that the average age goes up by less than a year for each consecutive season.

The Celtics Dynasty of the 1950s and 1960s starts in 1957 and goes to 1969, a stretch of 13 seasons where they won 11 championships, beating the Lakers in the Finals seven times. The Celtics were 27.0 in 1957, and 30.4 in 1969. Their 1969 team was the oldest ever to win a championship until the last two Michael Jordan Bulls’ teams of 1997 and 1998 who were the oldest ever.

The Larry Bird Celtics won the championship three times between 1981 and 1986, a period where they aged from 27.0 to 29.3. The Showtime Lakers won 5 times between 1980 and 1988, where they aged from 26.2 to 28.9. The Lakers’ increase wasn’t consistent, most likely due to the decreasing minutes of Kareem Abdul Jabbar who was 40 in 1988, and one of the oldest players ever to provide a meaningful contribution to his team (18.2 ppg, 10.9 rebs in 1988) at that age.

The Bulls teams of the 1990s won three-peats between 1991-1993, and again in 1996-1998 after Jordan’s first retirement and comeback. The first Bulls championship team of 1991 was young, just 26.9. In 1992 and 1993, they were 27.6 and 28.0 respectively. Starting with their second three-peat in 1996, they were already one of the oldest championship teams ever. I don’t think the Bulls could have won that second three-peat if not for the year and a half Jordan took off from basketball and ‘rested’ his legs playing baseball. It is inconceivable that Jordan could have played a near decade of 100 plus games and still manage to win six championships. In retrospect, when examining his age and Bulls’ age during that second three-peat, his first retirement was an ingenious move. The only other player on all six championship teams was Scottie Pippen, who didn’t take any time off in the 1990s, and who nearly carried the Bulls sans Jordan to the Eastern Conference Finals in 1994. During the championship run of the 1990s, Pippen managed to average 79 games started a season until the 1998 championship season, where age and fatigue finally caught up with him and he only started 44 games.

The Bulls broke up after the 1998 season, and the entire city of Chicago vilified general manager Jerry Krause for not bringing back the team nucleus of Coach Phil Jackson, Michael Jordan and Scottie Pippen. The Bulls were probably too old and tired to win again in 1999, but no one knew that 1999 would be a lockout-shortened 50 game season. The shortened season and the extra three months of rest would have been exactly what the very old, hypothetical 1999 Bulls needed for a chance for the first 4-peat since the 1960s Celtics.

The San Antonio Spurs won four times between 1999 and 2007, and if we exclude the anomalous 1999 lockout season, they show the same pattern of getting older.

The only exception so far is the three-peat Shaq and Kobe Lakers of the early 2000s, who actually got younger. The two-time defending championship Lakers of 2009-2010 have aged from 27.4 to 28.4, and are 30.3, more than a year older from 2010 through 57 games of the 2011 season. Out of the top eight Lakers players as measured by minutes played, 7 of them are 30 or over. This year looks like the last gasp of the current Lakers team and Kobe Bryant, with Phil Jackson set to retire and the nucleus (Bryant, Gasol, Odom, Fisher, and Artest) all 30 or over.

For the R code used for this graph and the full final season box scores for all teams, click here.

NPR Will Shortz word puzzle 1/16/2011, Solution on GitHub!

In github, hacking, NPR, python, word_puzzle on January 22, 2011 at 12:12 am

This week’s puzzle:

From listener Mike Shteyman of Reisterstown, Md.: Take the first seven letters of the alphabet, A through G, change one of these letters to another letter that is also either A, B, C, D, E, F or G. Rearrange the result to spell a familiar seven-letter word. What word is it?

This puzzle was an easy one to solve with Python and I don’t think I could have solved it the old-fashioned way of ‘only’ using my brain.

I’ve been attemting to solve these puzzles for a few months now, and I’ve had to write certain functions over and over, so I took the time this week to consolidate some of the more common functions I’ve used into a utility file, util.py.

I also went through the awesome Git tutorial Git Immersion, with much thanks to EdgeCase Software Artisans and Jim Weirich. With a decent grasp of the basics of Git, I decided to put all the code up on GitHub here. From now on I’ll use git for any coding I do, and I’ll probably end up hosting a bunch of it as publicly available code on GitHub.

Back to this week’s puzzle – the first step is to create all possible strings that start with ‘abcdefg’ and then swap one of those letters with a letter from the same string. We replace each character (7 total in ‘adcdefg’) with 6 possible replacements (any character expect the original one), and have a total of 42 (6*7) strings to test for anagrams.

Each of the 42 strings has one letter repeated twice. The number of arrangements/anagrams for a string of 7 characters with one repeat is 7!/2! = 2,520.

idris@idris-laptop:~/work/npr_puzzles$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.factorial(7) / math.factorial(2)
2520

So there are 2,520 * 42 = 105,840 anagrams to look up in the dictionary for existence, which takes less than a second to execute.

The only answer that showed up was “feedbag”, which you can get by replacing the “c” in “abcdefg” with an “e” which gets you ‘abedefg’ which can be rearranged to “feedbag.”

The puzzle said the answer should be “a familiar seven-letter word.” I suppose “feedbag” qualifies if you, like puzzle submitter Mike Shteyman, hail from the unincorporated Maryland town of Reisterstown.

Looking forward to see if the answer is correct, next week’s puzzle, and sharing more code on GitHub!