Idris Raja

Change of Address

In Uncategorized on April 13, 2012 at 5:42 am

This blog is no longer active and will be closed shortly.
Please visit my new address at idris.heroku.com

5+1 Quick Blurbs

In Uncategorized on April 19, 2011 at 5:51 am

Big data: Global good or zero-sum arms race?

Will the ability to do massive amounts of data analysis on relatively cheap commodity hardware cause further income inequality, with the inflows moving towards those able to harness this power and away from those who can’t, or will big data be a rising tide that lifts all boats?
Right now I think the former is true, but as big data starts to move away from social media and web ad optimizations and more towards agriculture, smart water, and smart energy, I hope the latter will prevail.

What Lucky People Do Different?
Be open to new experiences, embrace uncertainty, and if all else fails just show up. You make your own luck

Data hand tools: grep, awk, sed, rinse repeat
You’ve mastered SQL and use Excel with your eyes closed. But do you know the super-fast, ultra-powerful basics? Being on a Windows machine is no excuse, configure Cygwin and start to feel the power of tools built 30 years ago that you want in your toolbox: grep, awk, sed (and regex).

This is what cheap electronics look like.

This is what cheap electronics look like

My sister recently told me that my five year old nephew loves coming home and going straight to his math workbook to do his ‘work.’ So I thought I’d get him something to really stoke his love of math – a calculator. I know we’re at a pivot point for super-cheap computation, but this image brought that point home like nothing else I’ve seen.

Something for me to work on

Something for the robot to work on

NPR Puzzle: Finding Synonyms with Python and WordNet

In github, hacking, NPR, python, word_puzzle on February 1, 2011 at 8:45 am

This week’s puzzle asks:

From Alan Meyer of Newberg, Ore.: Think of a common word that’s six letters long and includes a Q. Change the Q to an N, and rearrange the result to form a new word that’s a synonym of the first one. What are the words?

This puzzle is a good opportunity to play with some very cool computational language tools available through the Natural Langauge Toolik (NLTK). NLTK is a group of libraries and functions that contain powerful tools for symbolic and statistical natural language processing (NLP). Check out the free NLTK book and go through the first chapter and the examples and tutorials will blow your mind. Be forewarned and/or foredelighted that the book contains heavy amounts of linguistic jargon, stats, and programming.

Before we get to using the NLTK, let’s break down this puzzle into multiple steps.

1. Think of a common word that’s six letters long and includes a Q.
2. Change the Q to an N, and rearrange the result to form a new word
3. that’s a synonym of the first one

Approach:
1. This is straight forward. We simply go through the English dictionary and grab any words with a ‘q’ that are 6 letters long.

2. This is also straight forward. We swap the ‘q’ with an ‘n’, then create all possible anagrams and check if any of those anagrams are in the English dictionary.

3. This step is the fun one. It requires the computer to understand the meaning of the words from step 1 and 2 and see if they are synonyms. How can a computer possibly do this? Here is when to use the NLTK. Included as part of the NLTK is WordNet, a lexical database, that among many other things, groups synonyms into groups called synsets. But it isn’t just a dumb list of words with similar meanings, because the same word can be used to mean very different things, and two synonyms of a word might not be synonyms for each other.

This is best demonstrated through example and the use of one our language’s most versatile words – shit.

idris@idris-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import wordnet as wn
>>> syn_sets = wn.synsets('shit')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('crap.n.01') synonyms:['crap', 'dirt', 'shit', 'shite', 'poop', 'turd']
Synset('bullshit.n.01') synonym:['bullshit', 'bull', 'Irish_bull', 'horseshit', 'shit', 'crap', 'dogshit']
Synset('jack.n.01') synonyms:['jack', 'doodly-squat', 'diddly-squat', 'diddlysquat', 'diddly-shit', 'diddlyshit', 'diddly', 'diddley', 'squat', 'shit']
Synset('shit.n.04') synonyms:['shit', 'dump']
Synset('asshole.n.01') synonyms:['asshole', 'bastard', 'cocksucker', 'dickhead', 'shit', 'mother_fucker', 'motherfucker', 'prick', 'whoreson', 'son_of_a_bitch', 'SOB']
Synset('damn.n.01') synonyms:['damn', 'darn', 'hoot', 'red_cent', 'shit', 'shucks', "tinker's_damn", "tinker's_dam"]
Synset('denounce.v.04') synonyms:['denounce', 'tell_on', 'betray', 'give_away', 'rat', 'grass', 'shit', 'shop', 'snitch', 'stag']
Synset('stool.v.04') synonyms:['stool', 'defecate', 'shit', 'take_a_shit', 'take_a_crap', 'ca-ca', 'crap', 'make']

So what I’ve done here is started Python in the terminal, and then installed the wordnet module from nltk.corpus. To show off WordNet, I use it to show all synonym sets which has ‘shit’ as a member. The member words of a particular synonym set are all synonyms of each other. For example, the synset named ‘stool.v.04’ contains all the words that are synonyms of the word stool in one of its verb forms, including stool, defecate, shit, take a shit, take a crap, ca ca, crap and make. Similarly, we see that shit is a member of synset ‘jack.n.01’, as in you aint got shit. Notice how the words in these two synsets are not synonyms of each other, for example, defecate from ‘stool.v.04’ is not equivalent in meaning to diddlysquat in ‘jack.n.01’. But those words are both synonyms for at least one meaning of shit.

Back to the puzzle at hand, we need to see if any of the words from step 1 and step 2 are synonyms. I wrote the simple Python function check_synonym to take two words and see if the two are in any synonym sets together.

def check_synonym(word, word2):
    """checks to see if word and word2 are synonyms"""
    l_syns = list()
    synsets = wn.synsets(word)
    for synset in synsets:
        if word2 in synset.lemma_names:
            l_syns.append( (word, word2) )
    return l_syns

Going through all the word pairs generated in step 1 and 2, I find there is only one pair of words that satisfy all constraints – uneasy and queasy. Below are the synonym sets that contain queasy:

>>> syn_sets = wn.synsets('queasy')
>>> for syn_set in syn_sets:
...     print '%s synonyms:\t%s' % (syn_set, syn_set.lemma_names)
Synset('nauseating.s.01') synonyms: ['nauseating', 'nauseous', 'noisome', 'queasy', 'loathsome', 'offensive', 'sickening', 'vile']
Synset('nauseated.s.01') synonyms: ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish']
Synset('anxious.s.02') synonyms: ['anxious', 'nervous', 'queasy', 'uneasy', 'unquiet']

As we can see, queasy and uneasy are both members of the ‘anxious.s.02’ synonym set. This is just the tip of the iceberg of what NLTK and WordNet can do. Both are used for cutting edge research and applications that use computers to read and understand text data. IBM’s Watson is one bleeding edge example of what is possible with computers and natural language processing. Cool, huh?

Full code available here at github here.