I started off the afternoon familiarizing myself with the jQuery in preparation to code Textsmith. After that, I started to ponder what rules should be used to clean up English (to start off with) text. I thought I was smart to steer clear of the more complicated regions of language that I get myself tangled in at times, and to start with something seemingly simple. Automatic capitalization of text.
Little did I realize that that in itself is no mean feat. Once we get past the easily programmable rules such as:
- The word I
- The first letter of each sentence
- For each article, take the title of the article
- If it is a single word, look for all instances of the word in that article where it is not used at the beginning of a sentence. If it is capitalized, it is a proper noun.
- If it is a phrase, it is a proper noun if the phrased is used in the exact same caps throughout the article. An example of a multi-word proper noun on Wikipedia is New York University while a non-proper noun would be data mining.
- We may possibly analyze related articles, but I didn’t think about what method of identifying related documents to use.
However there are still problems to be addressed:
- What about odd words like jQuery?
- While Wikipedia would provide fairly good coverage, it is definitely not an exhaustive representation of the text available online. The algorithm above would not be able to parse unstructured text too.
Assuming we establish a sufficiently comprehensive database of proper nouns, including names of people, products, etc., the next question is how can we efficiently and correctly identify proper nouns in a body of text? We start to run into problems such as:
- Context awareness. The word dell is both a common noun and a proper noun, depending on the context. There is some nice (unpublished?) work done by Google as demonstrated on this Google Wave video.
- There are approximately 3,892,495 articles on Wikipedia as of this writing. How many proper nouns would there be in here, and how many more are we missing?
Thankfully there is a project OpenCalais that works on semantically tagging text. It is free to use via an API but not completely open, which is a little of a bummer from a philosophical point of view.
I didn’t realize that just thinking about capitalization could make my brain hurt so much. There was some other nifty stuff that I read up on the topics of text analysis and how people are deriving information from unstructured text. There’s this interesting project by the AP called Overview that looks like it’s doing a pretty decent job of giving the user a high level overview of a large (hundreds of thousands) corpus of documents, it actually feels tempting to me to deviate from the original goal of Textsmith.
So where do we go from here?
There is an overwhelming amount of hard problems in the world. I’m not saying that to put undue pressure on myself, or to use it as an excuse to give up. I think problems matter differently to different people, and that is a contributing factor to the diversity that we get to enjoy. If everyone was an electrical engineer, or everyone was a writer, the world would be a lot poorer.