Saturday, January 26, 2008

Guessing the texts language (raspell)

For my work I have been importing documents from a fairly large collection of documents. I have been building a search index over the documents using sphinx to make it easier to find them. I realized today however that my documents contain a fairly large number of spanish documents. I want to maintain two search indexes in this case, one for english and one for spanish. Many of the documents however, don't specify their language. I decided a pretty simple way to classify a body of text as one language or the other would be to perform lookups in a particular language dictionary. At first, I looked into using /usr/share/dict/words, but this has the issue of being any language depending on the users OS. It also meant that for each language I would have to obtain a copy of that dictionary... Probably not that difficult to write, but then I found raspell. A nice little wrapper around aspell, making it super easy to check for a word in nearly any language. Below, is my first pass at a simple language classifier. It has one obvious issue that if two dictionaries have equal matching vocabulary there's no way to disambiguate, but for now I'm happy with this solution...

require 'rubygems'
require 'raspell'

module Language
class Classifier
def initialize( *languages )
@dics = []
languages.each do|lang|
speller = Aspell.new(lang)
speller.suggestion_mode = Aspell::ULTRA
@dics << { :check => speller, :lang => lang }
end
end

def likely_language( text )
# grab first 100 words and the last 100 words
words = text.split(' ')
lang = "unknown"
if words.size >= 100
first_set = words[0..100]
second_set = words[words.size-100..words.size]
else
first_set = words[0..words.size]
second_set = []
end
lang1 = simple_classify( first_set )
lang2 = simple_classify( second_set )
if lang1 == lang2
lang = lang1
end
lang
end

def simple_classify( words )
rankings = @dics.collect do|dic|
matching = 0
words.each do|word|
matching += 1 if dic[:check].check(word)
end
[matching,dic[:lang]]
end
sorted = rankings.sort do|score1,score2|
score1.first <=> score2.first
end
sorted.last.last
end

end
end

=begin
classifier = Language::Classifier.new("en_US","es")
t= Time.now
10.times do
puts classifier.likely_language(File.read("spanish.txt")) .inspect
puts classifier.likely_language(File.read("english.txt")) .inspect
end
puts (Time.now - t)
=end

0 comments:

Reading list