Recently I've been collecting data from
Slashdot.
Using a combination of C# (using
SharpDevelop 2.0) and
MySQL, every four hours
I check their rss feed, update a couple of tables, and
publish a
summary of the
top 100 words used. I still haven't mastered parsing
words from the article descriptions. One thing in particular that
I haven't figure out is how to determine that a single quote
character is being used as a single quote or as an
apostrophe. As a result of this shortcoming the letters 's'
and 't' (which are commonly used in contractions) are on
the top 100 list.
Initially I wanted to use this data to come up with a way
of detecting those infamous duplicate articles. Well, I
haven't got that far yet. Maybe someday I'll get around to
it.