CrashCodes.com

Recently I've been collecting data from Slashdot. Using a combination of C# (using SharpDevelop 2.0) and MySQL, every four hours I check their rss feed, update a couple of tables, and publish a summary of the top 100 words used. I still haven't mastered parsing words from the article descriptions. One thing in particular that I haven't figure out is how to determine that a single quote character is being used as a single quote or as an apostrophe. As a result of this shortcoming the letters 's' and 't' (which are commonly used in contractions) are on the top 100 list.

Initially I wanted to use this data to come up with a way of detecting those infamous duplicate articles. Well, I haven't got that far yet. Maybe someday I'll get around to it.