Idris Raja

Archive for the ‘hacking’ Category

NPR Puzzle: Finding Synonyms with Python and WordNet

In github, hacking, NPR, python, word_puzzle on February 1, 2011 at 8:45 am

This week’s puzzle asks:

From Alan Meyer of Newberg, Ore.: Think of a common word that’s six letters long and includes a Q. Change the Q to an N, and rearrange the result to form a new word that’s a synonym of the first one. What are the words?

This puzzle is a good opportunity to play with some very cool computational language tools available through the Natural Langauge Toolik (NLTK). NLTK is a group of libraries and functions that contain powerful tools for symbolic and statistical natural language processing (NLP). Check out the free NLTK book and go through the first chapter and the examples and tutorials will blow your mind. Be forewarned and/or foredelighted that the book contains heavy amounts of linguistic jargon, stats, and programming.

Before we get to using the NLTK, let’s break down this puzzle into multiple steps.

1. Think of a common word that’s six letters long and includes a Q.
2. Change the Q to an N, and rearrange the result to form a new word
3. that’s a synonym of the first one

Approach:
1. This is straight forward. We simply go through the English dictionary and grab any words with a ‘q’ that are 6 letters long.

2. This is also straight forward. We swap the ‘q’ with an ‘n’, then create all possible anagrams and check if any of those anagrams are in the English dictionary.

3. This step is the fun one. It requires the computer to understand the meaning of the words from step 1 and 2 and see if they are synonyms. How can a computer possibly do this? Here is when to use the NLTK. Included as part of the NLTK is WordNet, a lexical database, that among many other things, groups synonyms into groups called synsets. But it isn’t just a dumb list of words with similar meanings, because the same word can be used to mean very different things, and two synonyms of a word might not be synonyms for each other.

This is best demonstrated through example and the use of one our language’s most versatile words – shit.

idris@idris-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import wordnet as wn
>>> syn_sets = wn.synsets('shit')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('crap.n.01') synonyms:['crap', 'dirt', 'shit', 'shite', 'poop', 'turd']
Synset('bullshit.n.01') synonym:['bullshit', 'bull', 'Irish_bull', 'horseshit', 'shit', 'crap', 'dogshit']
Synset('jack.n.01') synonyms:['jack', 'doodly-squat', 'diddly-squat', 'diddlysquat', 'diddly-shit', 'diddlyshit', 'diddly', 'diddley', 'squat', 'shit']
Synset('shit.n.04') synonyms:['shit', 'dump']
Synset('asshole.n.01') synonyms:['asshole', 'bastard', 'cocksucker', 'dickhead', 'shit', 'mother_fucker', 'motherfucker', 'prick', 'whoreson', 'son_of_a_bitch', 'SOB']
Synset('damn.n.01') synonyms:['damn', 'darn', 'hoot', 'red_cent', 'shit', 'shucks', "tinker's_damn", "tinker's_dam"]
Synset('denounce.v.04') synonyms:['denounce', 'tell_on', 'betray', 'give_away', 'rat', 'grass', 'shit', 'shop', 'snitch', 'stag']
Synset('stool.v.04') synonyms:['stool', 'defecate', 'shit', 'take_a_shit', 'take_a_crap', 'ca-ca', 'crap', 'make']

So what I’ve done here is started Python in the terminal, and then installed the wordnet module from nltk.corpus. To show off WordNet, I use it to show all synonym sets which has ‘shit’ as a member. The member words of a particular synonym set are all synonyms of each other. For example, the synset named ‘stool.v.04’ contains all the words that are synonyms of the word stool in one of its verb forms, including stool, defecate, shit, take a shit, take a crap, ca ca, crap and make. Similarly, we see that shit is a member of synset ‘jack.n.01’, as in you aint got shit. Notice how the words in these two synsets are not synonyms of each other, for example, defecate from ‘stool.v.04’ is not equivalent in meaning to diddlysquat in ‘jack.n.01’. But those words are both synonyms for at least one meaning of shit.

Back to the puzzle at hand, we need to see if any of the words from step 1 and step 2 are synonyms. I wrote the simple Python function check_synonym to take two words and see if the two are in any synonym sets together.

def check_synonym(word, word2):
    """checks to see if word and word2 are synonyms"""
    l_syns = list()
    synsets = wn.synsets(word)
    for synset in synsets:
        if word2 in synset.lemma_names:
            l_syns.append( (word, word2) )
    return l_syns

Going through all the word pairs generated in step 1 and 2, I find there is only one pair of words that satisfy all constraints – uneasy and queasy. Below are the synonym sets that contain queasy:

>>> syn_sets = wn.synsets('queasy')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('nauseating.s.01') synonyms: ['nauseating', 'nauseous', 'noisome', 'queasy', 'loathsome', 'offensive', 'sickening', 'vile']
Synset('nauseated.s.01') synonyms: ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish']
Synset('anxious.s.02') synonyms: ['anxious', 'nervous', 'queasy', 'uneasy', 'unquiet']

As we can see, queasy and uneasy are both members of the ‘anxious.s.02’ synonym set. This is just the tip of the iceberg of what NLTK and WordNet can do. Both are used for cutting edge research and applications that use computers to read and understand text data. IBM’s Watson is one bleeding edge example of what is possible with computers and natural language processing. Cool, huh?

Full code available here at github here.

NBA Champs Team Age

In github, hacking, python, R on January 31, 2011 at 7:15 am

For a while now I’ve wanted to start mucking with the free open source software (FOSS) stats and graphing program R. I needed a dataset to mess around, so I scraped the season box scores for every NBA regular season from 1950 to 2010 from basketball-reference.com.

I used Python to calculate an average age for each team by weighting each player’s age by the proportion of minutes played compared to the team’s total minutes for the regular season. For calculation purposes I use the player’s integer age on February 1st of the season.

The interesting thing about this data is to isolate the championship teams and see the age trends for the different dynasties. From the 1950s to the 1990s, each team gets older as it continues to win championships. This is not surprising, as each dynasty has a core of star players who get a year older each season. These core players play the majority of minutes, and heavily weight the average age. Both role players and bench players are usually recycled, with older ones being replaced by younger ones. This is why we can see that the average age goes up by less than a year for each consecutive season.

The Celtics Dynasty of the 1950s and 1960s starts in 1957 and goes to 1969, a stretch of 13 seasons where they won 11 championships, beating the Lakers in the Finals seven times. The Celtics were 27.0 in 1957, and 30.4 in 1969. Their 1969 team was the oldest ever to win a championship until the last two Michael Jordan Bulls’ teams of 1997 and 1998 who were the oldest ever.

The Larry Bird Celtics won the championship three times between 1981 and 1986, a period where they aged from 27.0 to 29.3. The Showtime Lakers won 5 times between 1980 and 1988, where they aged from 26.2 to 28.9. The Lakers’ increase wasn’t consistent, most likely due to the decreasing minutes of Kareem Abdul Jabbar who was 40 in 1988, and one of the oldest players ever to provide a meaningful contribution to his team (18.2 ppg, 10.9 rebs in 1988) at that age.

The Bulls teams of the 1990s won three-peats between 1991-1993, and again in 1996-1998 after Jordan’s first retirement and comeback. The first Bulls championship team of 1991 was young, just 26.9. In 1992 and 1993, they were 27.6 and 28.0 respectively. Starting with their second three-peat in 1996, they were already one of the oldest championship teams ever. I don’t think the Bulls could have won that second three-peat if not for the year and a half Jordan took off from basketball and ‘rested’ his legs playing baseball. It is inconceivable that Jordan could have played a near decade of 100 plus games and still manage to win six championships. In retrospect, when examining his age and Bulls’ age during that second three-peat, his first retirement was an ingenious move. The only other player on all six championship teams was Scottie Pippen, who didn’t take any time off in the 1990s, and who nearly carried the Bulls sans Jordan to the Eastern Conference Finals in 1994. During the championship run of the 1990s, Pippen managed to average 79 games started a season until the 1998 championship season, where age and fatigue finally caught up with him and he only started 44 games.

The Bulls broke up after the 1998 season, and the entire city of Chicago vilified general manager Jerry Krause for not bringing back the team nucleus of Coach Phil Jackson, Michael Jordan and Scottie Pippen. The Bulls were probably too old and tired to win again in 1999, but no one knew that 1999 would be a lockout-shortened 50 game season. The shortened season and the extra three months of rest would have been exactly what the very old, hypothetical 1999 Bulls needed for a chance for the first 4-peat since the 1960s Celtics.

The San Antonio Spurs won four times between 1999 and 2007, and if we exclude the anomalous 1999 lockout season, they show the same pattern of getting older.

The only exception so far is the three-peat Shaq and Kobe Lakers of the early 2000s, who actually got younger. The two-time defending championship Lakers of 2009-2010 have aged from 27.4 to 28.4, and are 30.3, more than a year older from 2010 through 57 games of the 2011 season. Out of the top eight Lakers players as measured by minutes played, 7 of them are 30 or over. This year looks like the last gasp of the current Lakers team and Kobe Bryant, with Phil Jackson set to retire and the nucleus (Bryant, Gasol, Odom, Fisher, and Artest) all 30 or over.

For the R code used for this graph and the full final season box scores for all teams, click here.

NPR Will Shortz word puzzle 1/16/2011, Solution on GitHub!

In github, hacking, NPR, python, word_puzzle on January 22, 2011 at 12:12 am

This week’s puzzle:

From listener Mike Shteyman of Reisterstown, Md.: Take the first seven letters of the alphabet, A through G, change one of these letters to another letter that is also either A, B, C, D, E, F or G. Rearrange the result to spell a familiar seven-letter word. What word is it?

This puzzle was an easy one to solve with Python and I don’t think I could have solved it the old-fashioned way of ‘only’ using my brain.

I’ve been attemting to solve these puzzles for a few months now, and I’ve had to write certain functions over and over, so I took the time this week to consolidate some of the more common functions I’ve used into a utility file, util.py.

I also went through the awesome Git tutorial Git Immersion, with much thanks to EdgeCase Software Artisans and Jim Weirich. With a decent grasp of the basics of Git, I decided to put all the code up on GitHub here. From now on I’ll use git for any coding I do, and I’ll probably end up hosting a bunch of it as publicly available code on GitHub.

Back to this week’s puzzle – the first step is to create all possible strings that start with ‘abcdefg’ and then swap one of those letters with a letter from the same string. We replace each character (7 total in ‘adcdefg’) with 6 possible replacements (any character expect the original one), and have a total of 42 (6*7) strings to test for anagrams.

Each of the 42 strings has one letter repeated twice. The number of arrangements/anagrams for a string of 7 characters with one repeat is 7!/2! = 2,520.

idris@idris-laptop:~/work/npr_puzzles$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.factorial(7) / math.factorial(2)
2520

So there are 2,520 * 42 = 105,840 anagrams to look up in the dictionary for existence, which takes less than a second to execute.

The only answer that showed up was “feedbag”, which you can get by replacing the “c” in “abcdefg” with an “e” which gets you ‘abedefg’ which can be rearranged to “feedbag.”

The puzzle said the answer should be “a familiar seven-letter word.” I suppose “feedbag” qualifies if you, like puzzle submitter Mike Shteyman, hail from the unincorporated Maryland town of Reisterstown.

Looking forward to see if the answer is correct, next week’s puzzle, and sharing more code on GitHub!

NPR Will Shortz Puzzle: 12/26/2010

In hacking, NPR, word_puzzle on January 5, 2011 at 11:26 pm

This week’s puzzle was a fun one.

Name a famous American from the past who has seven letters in his or her last name. Take the last two letters, plus the first four letters, in that order, and you’ll name that person’s profession. Who is it?

There are several parts to this puzzle and it’s easiest to tackle each one individually.

1) Think of a famous American whose last name is seven letters long.     This part is very broad, and there are possibly thousands or tens of thousands of names that can fit this description. My brain certainly can’t easily categorize people into groups based on length of last name.

2) Take his / her last name, and move the last two letters in front of the first four letters and you have the person’s profession.

Again, moving the letters is difficult for my brain, and I couldn’t think of an easy way of doing this part in reverse: thinking of a profession, and then forming the last name to see if it matches a known famous American.

So, of course and as usual, I’ll use Python, command-line fu, Wikipedia, and an English word list to solve this puzzle.

Step 1 – We’re looking for a famous American, and therefore someone whose name is in Wikipedia. I assume that if you don’t have an English Wikipedia entry, you aren’t a famous American. In our Wikipedia list, underscores replaced spaces, into the form firstname_lastname. I ignore all entries that don’t have at least one underscore, as the entry is unlikely to be a name.

def has_underscore_middle(s):
    """returns true if string 's' has underscore not including end or beginning"""
    s = s[1:-1]
    return s.find('_') > -1

Next, I test the last word in the entry to see if it has seven letters. It’s possible that a name like John F. Kennedy, Jr. would be incorrectly eliminated because of the suffix Jr., but I choose to ignore such edge cases for now. If our method doesn’t yield a solution, we can go back and possibly make this refinement.

def has_right_length(s, l = 7):
    """return True if string 's' is correct length 'l'=7"""
    return len(s) == l

Next, I test to see that all characters in the last name are ASCII characters. Wikipedia entries can have grammar marks, numbers, foreign characters, Unicode characters, etc., all of which are unlikely to be contained in the last name of a famous American. Again, this assumption can be modified if this approach doesn’t yield a solution.

def has_ascii_letters_only(s):
    """return True if string 's' is made up only of ascii letters"""
    for char in s:
        if char not in string.ascii_letters:
            return False
    return True

Step 2 – Once we have a list of Wikipedia entries that look like names and have the correct last name length, we are ready to see if moving the letters around in the last name return a profession. I don’t have a list of professions, but I do have a list of all words in the English dictionary. This list from WordNet is specifically called the crossword dictionary as it is all valid entries in English crosswords. There are about 110,000 words in this list and I assume that the list of professions is a subset of this dictionary.

With that assumption, I can move the letters around in the last name and see if it is in the dictionary. If yes, I output the word from the dictionary, the last name, and the original Wikipedia entry onto one line in the output file.

def get_from_underscore(s):
    """accepts string 's' that is expected to have "_"
    return from "_" on, exclusive"""
    reverse = s[::-1]
    loc_ = reverse.find('_')
    return s[-loc_:]

def mix_last_name(s):
    """Takes a seven letter string 's' and creates new 6 letter string
    with last 2 letters of 's' + first four letters of 's'
    Ex: "abcdefg" --> "fgabcd" """
    return s[-2:] + s[:4]

Running Step 1 and Step 2 takes approximately 40 seconds. Now I need to manually scan the list of 1,507 entries and see if I can find a profession.

The file is unsorted, which I could fix in Python but instead I’ll quickly run the command line function sort and do

sort -o output_sort.txt output.txt

which sorts output.txt by the first letter in the line and puts the result in output_sort.txt.

Scanning through the list, we quickly get lucky and see several lines which found ‘author’ as the word in the dictionary. There are several Wikipedia entries that are variations on the name Henry David Thoreau, certainly a famous American and a likely answer for an NPR contest.

This solution required a few assumptions, and some a priori knowledge such as a general knowledge of professions and famous Americans, but otherwise everything else was outsourced to the computer and Wikipedia.

You can see the full code here at github.

Looking forward to next week’s puzzle which looks like a lot of fun, as we’ll have to find a way to find a two-word synonym for a single word, which will require a thesaurus and possibly more.

Hacking another NPR word problem

In hacking, word_puzzle on December 28, 2010 at 1:44 am

Last week’s puzzle: Name a city in the United States that ends in the letter S. The city is one of the largest cities in its state. Change the S to a different letter and rearrange the result to get the state the city is in. What are the city and state?

Now this one is admittedly pretty easy. We know that we need a state with 7 letters, and after finding this very convenient list on Wikipedia, I was able to scan the table and see that the answer is Yonkers, New York in less than a minute. But what if we had a list of hundreds of states? How could we ‘hack’ this puzzle?

First step is to grab that wikipedia page and save it as a file locally. Many ways to do this, my favorite way from the command line:

wget “http://en.wikipedia.org/wiki/List_of_U.S._states’_largest_cities_by_population”

That will save the file in the local directory. I remaned the file to city.html.

The next step is to extract the information in the table. Load the page in Google Chrome and then right click anywhere on the table and choose ‘Inspect Element’. We can see that the state and city information is contained in a table with class = wikitable, and the individual rows are in tr tags.

After opening the page in a Python file and parsing it with the functions etree.parse and the etree.HTMLParser from the lxml module, We can use two XPath’s to get at this information.

xps1 = “//table[@class=”wikitable”]//tr’ and

xp2s = ‘.//td//a/text()’

The first captures all the tr tags in the table, and the second allows us to capture all the state and city information in the rows. From here it’s trivial, as we can simply see which states are seven letters long and have cities that are seven letters long – 13 total. If that was too many, we would need test to see which of those pairs share 6 of 7 letters.

I’ll leave that for now as next week’s puzzle seems a bit more challenging.

Solving an NPR Will Shortz Word Puzzle with Python and Wikipedia

In hacking, word_puzzle on December 21, 2010 at 12:05 am

Word problems are fun. So is being lazy and using a computer and Wikipedia to solve said problem.

The problem at hand: From Ward Hartenstein of Rochester, N.Y.: Rearrange the letters of “Wayne Manor” to name two well-known American corporations, past or present. What corporations are they?

It is fairly easy to use a brute force approach to calculate all possible anagrams of a relatively short string. For example, using the function permutations, we can use recursion to create all possible variations of a string. The maximum number of anagrams for a string is n!, where n is the number of characters. If there are any character duplicates, or restrictions, such as not being able to have a space at the beginning or end of the word, we have fewer anagram possibilities.

For “Wayne Manor”, ignoring the space for now, we have 10 characters which give 10! = 3,628,800 possible combinations. By using the python module cProfile, we can measure how long it takes to calculate a brute force approach to finding all anagrams of a string of length n. The time to calculate all anagram possibilities for a 10 character string is less than a minute; 12 characters would be more like 40 minutes. Our string is 10 characters, so we’re in luck.

We can see that the time to calculate all anagrams starts to get large, but for “Wayne Manor” it is acceptable to use this inelegant brute-force approach. Additionally, because there are character repeats in our string, two a’s and two n’s, the number of unique anagrams is reduced from 10! to 10!/(2!2!) = 907,200.

Now that we have the full 907,200 possible arrangements, we need to think about how many ways we can split the string into different two-word combinations. Since we need two company names, I will make the assumption that each will be at least 3 characters long. Therefore we can split the string with a space anywhere between the 3rd and 7th position. Therefore we have 5 spots to put the space and 907,200 * 5 = 4,536,000 different two word combinations to check.

Now the obvious question – how do we possibly check 4.5 million possible answers? This is where a wikipedia article title dump comes in handy. You can download the article title dump here, which contains over 7,000,000 article titles. Since we are dealing with American corporation names which are ostensibly well-known, I’ll assume that the correct answers have their own Wikipedia entries and can be found amongst the 7,000,000 entries.

We can then load all of the Wikipedia titles into a python dictionary, and then check to see if any of the two-word combinations show up in Wikipedia. There are 4,250 anagrams of “Wayne Manor” that are made up of two words with both words present as Wikipedia article titles. At this point, we have a more manageable, if not enviable, task of manually scanning the 4,250 entries for two company names.

Again, I will assume that the company names will be easily apparent to me. After scrolling through for a few seconds, I see “Enron” as a clear possibility, and it is paired with the words amway, mwaya, ymawa, manwa, mawan, namaw, and waman. Amway looks the most promising, and after a quick search we confirm that it is indeed an American corporation.

While certainly not elegant, efficient, entirely reproducible without a priori knowledge, or void of significant human intervention, the solution works. Good enuf ;>