Saturday, January 26, 2008

Guessing the texts language (raspell)

For my work I have been importing documents from a fairly large collection of documents. I have been building a search index over the documents using sphinx to make it easier to find them. I realized today however that my documents contain a fairly large number of spanish documents. I want to maintain two search indexes in this case, one for english and one for spanish. Many of the documents however, don't specify their language. I decided a pretty simple way to classify a body of text as one language or the other would be to perform lookups in a particular language dictionary. At first, I looked into using /usr/share/dict/words, but this has the issue of being any language depending on the users OS. It also meant that for each language I would have to obtain a copy of that dictionary... Probably not that difficult to write, but then I found raspell. A nice little wrapper around aspell, making it super easy to check for a word in nearly any language. Below, is my first pass at a simple language classifier. It has one obvious issue that if two dictionaries have equal matching vocabulary there's no way to disambiguate, but for now I'm happy with this solution...

require 'rubygems'
require 'raspell'

module Language
class Classifier
def initialize( *languages )
@dics = []
languages.each do|lang|
speller = Aspell.new(lang)
speller.suggestion_mode = Aspell::ULTRA
@dics << { :check => speller, :lang => lang }
end
end

def likely_language( text )
# grab first 100 words and the last 100 words
words = text.split(' ')
lang = "unknown"
if words.size >= 100
first_set = words[0..100]
second_set = words[words.size-100..words.size]
else
first_set = words[0..words.size]
second_set = []
end
lang1 = simple_classify( first_set )
lang2 = simple_classify( second_set )
if lang1 == lang2
lang = lang1
end
lang
end

def simple_classify( words )
rankings = @dics.collect do|dic|
matching = 0
words.each do|word|
matching += 1 if dic[:check].check(word)
end
[matching,dic[:lang]]
end
sorted = rankings.sort do|score1,score2|
score1.first <=> score2.first
end
sorted.last.last
end

end
end

=begin
classifier = Language::Classifier.new("en_US","es")
t= Time.now
10.times do
puts classifier.likely_language(File.read("spanish.txt")) .inspect
puts classifier.likely_language(File.read("english.txt")) .inspect
end
puts (Time.now - t)
=end

Sunday, January 06, 2008

mongrel esi release onto rubyforge!

I published the first gem of mongrel-esi tonight.

gem install mongrel_esi


Checkout the samples folder for examples of how to configure and setup the server.

ruby 1.9 and valgrind support

This evening I decided to stay in and take a look at valgrind with ruby 1.9. Turns out, there is now a compile option to build ruby 1.9 to be valgrind friendly by using the macros defined in valgrind/memcheck.h

./configure --with-valgrind --prefix=/home/taf2/project/mongrel-esi/trunk/ruby19-test/ && make


I think I might work on patching ruby 1.8.6 with these macros later to get a better sense for the memory usage with the mongrel-esi parser.

Update:

Evan has the patch here. As well as a great tutorial for how to use it.

Friday, January 04, 2008

[Mongrel ESI] Ragel Parser, and more!


I've been busy this new years. I was bitten by a bug to improve mongrel esi. First, I set down to finally master ragel. I initially, implemented the ragel parser using ruby. Then after some performance tests discovered while it had improved the performance stability it had actually reduced the average performance. It really wasn't too difficult, once I had the parser written and working in ruby to convert it into C, which today I can finally say is complete and all tests are again passing. With the new C ragel implementation I am seeing about a 2x improvement in raw performance. My methods for measuring performance have been largely based on ab (apache benchmark).


Today I spent time to really understand how mongrel_rails works and in doing so was able to rework the servers configuration so that it can take a simple ruby script or yaml file, but all configuration options are by default passed via the command line. Here's how the configuration works now:



ESI::Config.define(listeners) do|config|

# define the caching rules globally for all routes, defaults to ruby
config.cache do|c|
#c.memcached do|mc|
# mc.servers = ['localhost:11211']
# mc.debug = false
# mc.namespace = 'mesi'
# mc.readonly = false
#end
c.ttl = 600
end

# define rules for when to enable esi processing globally for all routes
config.esi do|c|
c.allowed_content_types = ['text/plain', 'text/html']
#c.enable_for_surrogate_only = true # default is false
end

# define request path routing rules
config.routes do|s|
#s.match( /content/ ) do|r|
# r.servers = ['127.0.0.1:4000']
#end
s.default do|r|
r.servers = ['127.0.0.1:3000']
end
end

end


I've been posting new gems to http://mongrel-esi.googlecode.com/files/mongrel_esi-0.4.0.gem

Wednesday, January 02, 2008

valgrind and ruby: developing a ruby c extension

I did a fair bit of work over the holiday on mongrel-esi. As part of that work I rework the parser in C using ragel. I always try to run my code through valgrind to help catch memory leaks and errors in my pointer arithmatic early.

The ragel parser is call back driven and can accept a variable sized segment of the document. Being able to read in variable sized chunks was very important, because it means the server can be implemented using
. The advantage of asynchronous I/O or multiplexed I/O in this case; is that while the kernel is waiting on the network the user app can be busy processing markup and even queing up more requests. This is really nice, because it means the server is doing multiple tasks simultaneously, without creating full threads or processses. Getting the parser built to support this variabled sized input was tricky, so I first focused on just the parser component. Ragel really saved me a lot of time, once I started to understand how to use it.

Here's the ESI C Parser API I came up with:

/* create a new Edge Side Include Parser */
ESIParser *esi_parser_new();
void esi_parser_free( ESIParser *parser );
/* initialize the parser */
int esi_parser_init( ESIParser *parser );
/*
* send a chunk of data to the parser, the internal parser state is returned
*/
int esi_parser_execute( ESIParser *parser, const char *data, size_t length );
/*
* let the parser no that it has reached the end and it should flush any remaining data to the desired output device
*/
int esi_parser_finish( ESIParser *parser );
/*
* setup a callback to execute when a new esi: start tag is encountered
* this is will fire for all block tags e.g. <esi:try>, <esi:attempt> and also
* inline tags <esi:inline src='cache-key'/> <esi:include src='dest'/>
*/
void esi_parser_start_tag_handler( ESIParser *parser, start_tag_cb callback );
void esi_parser_end_tag_handler( ESIParser *parser, end_tag_cb callback );
/* setup a callback to recieve data ready for output */
void esi_parser_output_handler( ESIParser *parser, output_cb output_handler );


I developed a fairly simple set of tests to verify the accuracy of the implmentation. Using valgrind with the --leak-check=full option I was able to measure the number of memory allocations and verify no memory would be lost.


valgrind --leak-check=full ./testit


Once I was statisfied that the parser core was working, I started to implement the Ruby binding. I started by following this tutorial as well as referring to many other documents and sources.
One of the first things I decided to verify with my ruby binding was whether in glueing my C implementation to the Ruby runtime I was leaking any memory. As with the pure C implemenation, I decided to run my extension through valgrind.

valgrind -leak-check=full ruby test1.rb

My initial test was this:

require 'esi'

output = ""
p = ESI::CParser.new

p.start_tag_handler do|tag_name, attrs|
puts "Start: #{tag_name} #{attrs.inspect}"
end

p.end_tag_handler do|tag_name|
puts "End: #{tag_name}"
end

p.output_handler do|data|
output << data
end

p.process "<html><head><body><esi:include timeout='1' max-age='600+600' src=\"hello\"/>some more input"
p.process "some input<esi:include \nsrc='hello'/>some more input\nsome input<esi:include src=\"hello\"/>some more input"
p.process "some input<esi:inline src='hello'/>some more input\nsome input<esi:comment text='hello'/>some more input"
p.process "<p>some input</p><esi:include src='hello'/>some more input\nsome input<esi:include src='hello'/>some more input"
p.process "</body></html>"
p.finish

expected = %Q(<html><head><body>some more inputsome inputsome more input
some inputsome more inputsome inputsome more input
some inputsome more input<p>some input</p>some more input
some inputsome more input</body></html>)

if( expected != output )
puts "Failed output was different from the expected"
puts "Expected: #{expected}"
puts "\n"
puts "Actual: #{output}"
exit(1)
end
GC.start


This is really a pretty simple test, that just ensures the callbacks are all working and that the parser data emitted excludes any esi tags.

The results I got from running this through valgrind, however were very disturbing. Not only at the end is valgrind reporting memory leaked, but nearly 4211 errors along the way.
The majority of these errors are the "Use of uninitialised value of size 4" and "Conditional jump or move depends on uninitialised value(s)".

I finally decided to figure out what was causing this. First to get ruby built with debugging symbols enabled. I downloaded the latest stable CVS snapshot, feeling optimistic in case I spot something and can send in a patch.


CFLAGS=-g ./configure --prefix=$HOME/project/ruby-stable && make && make install



Rerunning my ruby script through valgrind:

valgrind --leak-check=full --num-callers=24 ~/project/ruby-stable/bin/ruby test1.rb


Now the first error I see reported from valgrind looks like this:

==9911== Conditional jump or move depends on uninitialised value(s)
==9911== at 0x807305F: is_pointer_to_heap (gc.c:609)
==9911== by 0x8073023: mark_locations_array (gc.c:629)
==9911== by 0x80743B7: garbage_collect (gc.c:1367)
==9911== by 0x8074467: rb_gc (gc.c:1423)
==9911== by 0x8074479: rb_gc_start (gc.c:1440)
==9911== by 0x805F95B: call_cfunc (eval.c:5704)
==9911== by 0x805EEAF: rb_call0 (eval.c:5857)
==9911== by 0x8060461: rb_call (eval.c:6104)
==9911== by 0x8059278: rb_eval (eval.c:3482)
==9911== by 0x805467F: eval_node (eval.c:1434)
==9911== by 0x8054C61: ruby_exec_internal (eval.c:1640)
==9911== by 0x8054CA5: ruby_exec (eval.c:1660)
==9911== by 0x8054CC7: ruby_run (eval.c:1670)
==9911== by 0x8052B2D: main (main.c:48)


This takes me to the function is_pointer_to_heap in gc.c.

static inline int
is_pointer_to_heap(ptr)
void *ptr;
{
register RVALUE *p = RANY(ptr);
register RVALUE *heap_org;
register long i;

if (p < lomem || p > himem) return Qfalse;
if ((VALUE)p % sizeof(RVALUE) != 0) return Qfalse;

/* check if p looks like a pointer */
for (i=0; i < heaps_used; i++) {
heap_org = heaps[i].slot;
if (heap_org <= p && p < heap_org + heaps[i].limit)
return Qtrue;
}
return Qfalse;
}

It should pretty obvious from looking at that code why valgrind would report "Conditional jump or move depends on uninitialised value". The highlighted condition above is testing to make sure the memory is really within the heap allocated by ruby, by comparing the address of p to the lower heap address and upper heap address. I am not certain, but failry sure that the lomem and himem values must be the upper and lower bounds on a preallocated block of memory ruby allocates. This would mean it's safe to test p in this context. I still have the question and concern of why p would be uninitialized in the first place....

There are more errors being reported besides this and I hope to follow up with those next.

Reading list