Idris Raja

Archive for December, 2010|Monthly archive page

Hacking another NPR word problem

In hacking, word_puzzle on December 28, 2010 at 1:44 am

Last week’s puzzle:¬†Name a city in the United States that ends in the letter S. The city is one of the largest cities in its state. Change the S to a different letter and rearrange the result to get the state the city is in. What are the city and state?

Now this one is admittedly pretty easy. We know that we need a state with 7 letters, and after finding this very convenient list on Wikipedia, I was able to scan the table and see that the answer is Yonkers, New York in less than a minute. But what if we had a list of hundreds of states? How could we ‘hack’ this puzzle?

First step is to grab that wikipedia page and save it as a file locally. Many ways to do this, my favorite way from the command line:

wget “’_largest_cities_by_population”

That will save the file in the local directory. I remaned the file to city.html.

The next step is to extract the information in the table. Load the page in Google Chrome and then right click anywhere on the table and choose ‘Inspect Element’. We can see that the state and city information is contained in a table with class = wikitable, and the individual rows are in tr tags.

After opening the page in a Python file and parsing it with the functions etree.parse and the etree.HTMLParser from the lxml module, We can use two XPath’s to get at this information.

xps1 = “//table[@class=”wikitable”]//tr’ and

xp2s = ‘.//td//a/text()’

The first captures all the tr tags in the table, and the second allows us to capture all the state and city information in the rows. From here it’s trivial, as we can simply see which states are seven letters long and have cities that are seven letters long – 13 total. If that was too many, we would need test to see which of those pairs share 6 of 7 letters.

I’ll leave that for now as next week’s puzzle seems a bit more challenging.

Solving an NPR Will Shortz Word Puzzle with Python and Wikipedia

In hacking, word_puzzle on December 21, 2010 at 12:05 am

Word problems are fun. So is being lazy and using a computer and Wikipedia to solve said problem.

The problem at hand:¬†From Ward Hartenstein of Rochester, N.Y.: Rearrange the letters of “Wayne Manor” to name two well-known American corporations, past or present. What corporations are they?

It is fairly easy to use a brute force approach to calculate all possible anagrams of a relatively short string. For example, using the function permutations, we can use recursion to create all possible variations of a string. The maximum number of anagrams for a string is n!, where n is the number of characters. If there are any character duplicates, or restrictions, such as not being able to have a space at the beginning or end of the word, we have fewer anagram possibilities.

For “Wayne Manor”, ignoring the space for now, we have 10 characters which give 10! = 3,628,800 possible combinations. By using the python module cProfile, we can measure how long it takes to calculate a brute force approach to finding all anagrams of a string of length n. The time to calculate all anagram possibilities for a 10 character string is less than a minute; 12 characters would be more like 40 minutes. Our string is 10 characters, so we’re in luck.

We can see that the time to calculate all anagrams starts to get large, but for “Wayne Manor” it is acceptable to use this inelegant brute-force approach. Additionally, because there are character repeats in our string, two a’s and two n’s, the number of unique anagrams is reduced from 10! to 10!/(2!2!) = 907,200.

Now that we have the full 907,200 possible arrangements, we need to think about how many ways we can split the string into different two-word combinations. Since we need two company names, I will make the assumption that each will be at least 3 characters long. Therefore we can split the string with a space anywhere between the 3rd and 7th position. Therefore we have 5 spots to put the space and 907,200 * 5 = 4,536,000 different two word combinations to check.

Now the obvious question – how do we possibly check 4.5 million possible answers? This is where a wikipedia article title dump comes in handy. You can download the article title dump here, which contains over 7,000,000 article titles. Since we are dealing with American corporation names which are ostensibly well-known, I’ll assume that the correct answers have their own Wikipedia entries and can be found amongst the 7,000,000 entries.

We can then load all of the Wikipedia titles into a python dictionary, and then check to see if any of the two-word combinations show up in Wikipedia. There are 4,250 anagrams of “Wayne Manor” that are made up of two words with both words present as Wikipedia article titles. At this point, we have a more manageable, if not enviable, task of manually scanning the 4,250 entries for two company names.

Again, I will assume that the company names will be easily apparent to me. After scrolling through for a few seconds, I see “Enron” as a clear possibility, and it is paired with the words amway, mwaya, ymawa, manwa, mawan, namaw, and waman. Amway looks the most promising, and after a quick search we confirm that it is indeed an American corporation.

While certainly not elegant, efficient, entirely reproducible without a priori knowledge, or void of significant human intervention, the solution works. Good enuf ;>