Wednesday, May 14, 2008

Ruby Tag Library Word::Tagger

I needed a very simple tagger to extract words of interest from a corpus of medical documents we have on revolutionhealth.com

For this I wrote Word::Tagger included in rbtagger gem on rubyforge.

sudo gem install rbtagger


Word::Tagger expects a master list of tags and a set window size. When executed it stems the words in the document and slides a window over the document comparing the stemmed terms in the tag list against the words within the document. Visually, the initial matching algorithm works like this:







A maximum number of matches can be given, causing the tagger to reduce the number of tags by frequency of occurrence.

Using the tagger is easy:

tagger = Word::Tagger.new( ['Cat','hat'], :words => 4 )
tags = tagger.execute( 'the cAt and the hat' )
#assert_equal( ["Cat", "hat"], tags )


I also include a part of speech tagger based on Eric Brills tagger and the perl module, Lingua::BrillTagger written by Ken Williams.

This tagger may eventually be used to further improve the word tagger. A few ideas come to mind, such as only selecting the words included in noun phrases, or uses the part of speech tags to reduce the number of matched terms for larger documents.

0 comments:

Reading list