Wednesday, May 28, 2008

twitter meets amazon

Min put together this awesome little gem of a site: http://www.twizon.com/

Check it out:

Wednesday, May 14, 2008

Ruby Tag Library Word::Tagger

I needed a very simple tagger to extract words of interest from a corpus of medical documents we have on revolutionhealth.com

For this I wrote Word::Tagger included in rbtagger gem on rubyforge.

sudo gem install rbtagger


Word::Tagger expects a master list of tags and a set window size. When executed it stems the words in the document and slides a window over the document comparing the stemmed terms in the tag list against the words within the document. Visually, the initial matching algorithm works like this:







A maximum number of matches can be given, causing the tagger to reduce the number of tags by frequency of occurrence.

Using the tagger is easy:

tagger = Word::Tagger.new( ['Cat','hat'], :words => 4 )
tags = tagger.execute( 'the cAt and the hat' )
#assert_equal( ["Cat", "hat"], tags )


I also include a part of speech tagger based on Eric Brills tagger and the perl module, Lingua::BrillTagger written by Ken Williams.

This tagger may eventually be used to further improve the word tagger. A few ideas come to mind, such as only selecting the words included in noun phrases, or uses the part of speech tags to reduce the number of matched terms for larger documents.

ruby extension memory leak tracking

Using a valgrind and a nice patch for ruby 1.8.6, I now have rb-brill-tagger leak free.

Here's the process I went through. First off you need a linux environment to run valgrind. If you don't already have one setup, I recommend fedora core. It's super easy to run and has a pretty good track record with hardward. Also, yum is super easy to use to install new software.

First I svn co http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8_6

Then get the patch:



wget http://fauna.rubyforge.org/svn/bleak_house/trunk/ruby/valgrind.patch


Update patch now available here.


Apply the patch:

patch -p0 < valgrind.patch


Build ruby:

autoconf && ./configure --prefix=$HOME/work/ruby-valgrind && make && make install


Setup the new ruby environment:

export PATH=$HOME/work/ruby-valgrind/bin:$PATH


Verify you have the correct ruby:

which ruby


Install rubygems:

wget http://rubyforge.org/frs/download.php/35283/rubygems-1.1.1.tgz
tar -zxf rubygems-1.1.1.tgz
cd rubygems-1.1.1
ruby setup.rb install


Verify the rubygems install:

which gem


Install Rake:

gem install rake


Checking out rb-brill-tagger:

git clone git://github.com/taf2/rb-brill-tagger.git
cd rb-brill-tagger
rake


Running valgrind:

valgrind --leak-check=full ruby test/tagger_test.rb


Valgrind will take much longer to run but once the process has finished you should get some output similar to this:


rb-brill-tagger> valgrind --leak-check=full ruby test/tagger_test.rb
==17160== Memcheck, a memory error detector.
==17160== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==17160== Using LibVEX rev 1804, a library for dynamic binary translation.
==17160== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==17160== Using valgrind-3.3.0, a dynamic binary instrumentation framework.
==17160== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==17160== For more details, rerun with: -v
==17160==
loading tagger...
tagger loaded!
Loaded suite test/tagger_test
Started
time: 75.484142 sec 0.132478156802789 docs/sec
..
Finished in 76.943314 seconds.

2 tests, 1 assertions, 0 failures, 0 errors
==17160==
==17160== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 25 from 1)
==17160== malloc/free: in use at exit: 9,666,639 bytes in 176,968 blocks.
==17160== malloc/free: 1,897,877 allocs, 1,720,909 frees, 48,816,043 bytes allocated.
==17160== For counts of detected errors, rerun with: -v
==17160== searching for pointers to 176,968 not-freed blocks.
==17160== checked 7,364,900 bytes.
==17160==
==17160== 16 bytes in 1 blocks are definitely lost in loss record 1 of 17
==17160== at 0x4022828: malloc (vg_replace_malloc.c:207)
==17160== by 0x8070851: ruby_xmalloc (gc.c:114)
==17160== by 0x808C95B: local_append (parse.y:5640)
==17160== by 0x808CC8F: special_local_set (parse.y:6228)
==17160== by 0x80A4E58: rb_reg_search (re.c:946)
==17160== by 0x80B8B22: rb_str_split_m (string.c:3559)
==17160== by 0x8055885: call_cfunc (eval.c:5700)
==17160== by 0x805E0D1: rb_call0 (eval.c:5856)
==17160== by 0x805ECC0: rb_call (eval.c:6103)
==17160== by 0x805C733: rb_eval (eval.c:3479)
==17160== by 0x805C65E: rb_eval (eval.c:3473)
==17160== by 0x805AE4C: rb_eval (eval.c:3689)
==17160==
==17160==
==17160== 26 bytes in 9 blocks are definitely lost in loss record 2 of 17
==17160== at 0x4022828: malloc (vg_replace_malloc.c:207)
==17160== by 0x40FA1BF: strdup (in /lib/libc-2.6.so)
==17160== by 0x402BD41: tagger_context_add_to_lexicon (tagger.c:71)
==17160== by 0x402BBC8: BrillTagger_add_to_lexicon (rbtagger.c:26)
==17160== by 0x8055859: call_cfunc (eval.c:5709)
==17160== by 0x805E0D1: rb_call0 (eval.c:5856)
==17160== by 0x805ECC0: rb_call (eval.c:6103)
==17160== by 0x805C733: rb_eval (eval.c:3479)
==17160== by 0x805C8E6: rb_eval (eval.c:3133)
==17160== by 0x805E8BE: rb_call0 (eval.c:6007)
==17160== by 0x805ECC0: rb_call (eval.c:6103)
==17160== by 0x805C733: rb_eval (eval.c:3479)
==17160==
==17160==
==17160== 896 bytes in 30 blocks are possibly lost in loss record 13 of 17
==17160== at 0x4022828: malloc (vg_replace_malloc.c:207)
==17160== by 0x8070851: ruby_xmalloc (gc.c:114)
==17160== by 0x8054725: scope_dup (eval.c:8211)
==17160== by 0x805A07F: rb_yield_0 (eval.c:5078)
==17160== by 0x806353E: proc_invoke (eval.c:8622)
==17160== by 0x805E0D1: rb_call0 (eval.c:5856)
==17160== by 0x805ECC0: rb_call (eval.c:6103)
==17160== by 0x805C733: rb_eval (eval.c:3479)
==17160== by 0x8059E97: rb_yield_0 (eval.c:5027)
==17160== by 0x805A720: rb_yield (eval.c:5111)
==17160== by 0x80C6674: rb_ary_each (array.c:1138)
==17160== by 0x805E0D1: rb_call0 (eval.c:5856)
==17160==
==17160== LEAK SUMMARY:
==17160== definitely lost: 42 bytes in 10 blocks.
==17160== possibly lost: 896 bytes in 30 blocks.
==17160== still reachable: 9,665,701 bytes in 176,928 blocks.
==17160== suppressed: 0 bytes in 0 blocks.
==17160== Reachable blocks (those to which a pointer was found) are not shown.
==17160== To see them, rerun with: --leak-check=full --show-reachable=yes

Friday, May 09, 2008

quote of the day


"Corporate efficiency can be best described by having 13 managers in a room trying to figure out how to password protect 3 pages..."



And if that doesn't do it for you try this out for size:


Wednesday, May 07, 2008

I say cap. You say rpm. I cap your rpm!

First things first, we need to understand something about how rpm's work. When installing a rpm you are root. Your rpm is root. Sometimes you'll have a new application that you've setup using capistrano scripts. You try to explain to your hosting company the benefits of capistrano, but they refuse to listen... They leave you no choice. You don't have time to rethink all your installation scripts nor would it be practical to maintain to different sets of deployment code. Here's a pretty simple work around, making use of capistrano's recent introduction of a local cache and some simple shell scripts to toggle sshd configurations.

rootkit.spec

Summary: Test rooting
Name: rootkit
Version: 1
Release: 1
Source0: %{name}-%{version}.tar.gz
Group: root
License: /root/.ssh/known_hosts MIT
Buildroot: %{_tmppath}/%{name}
ExclusiveArch: i386 x86_64
Requires: bash
AutoReqProv: no
%description
test logging in as root temporary to install
%prep
%setup -q
%build
mkdir -p $RPM_BUILD_ROOT/tmp/root_test
cp -r ./* $RPM_BUILD_ROOT/tmp/root_test
%install
%clean
%post
sudo su -
whoami

rollback() {
cp /etc/ssh/sshd_config.bak /etc/ssh/sshd_config
cp /etc/ssh/ssh_config.bak /etc/ssh/ssh_config
/sbin/service sshd restart
if [ -e /root/.ssh/authorized_keys2.bak ]; then
cp /root/.ssh/authorized_keys2.bak /root/.ssh/authorized_keys2
else
rm /root/.ssh/authorized_keys2
fi
rm /root/.ssh/tmp_root_keys
rm /root/.ssh/tmp_root_keys.pub
}

# backup all configs
cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak
cp /etc/ssh/ssh_config /etc/ssh/ssh_config.bak

# only backup this key file if it exists
if [ -e /root/.ssh/authorized_keys2 ]; then
echo "Backup /root/.ssh/authorized_keys2"
cp /root/.ssh/authorized_keys2 /root/.ssh/authorized_keys2.bak
fi

# setup a rollback hook to reset our ssh mods
trap rollback INT TERM EXIT

# now setup our stricter yet looser ssh configs
cat /tmp/root_test/temp_sshd_config > /etc/ssh/sshd_config
cat /tmp/root_test/temp_sshd_config > /etc/ssh/sshd_config

# restart the sshd
/sbin/service sshd restart

# setup ssh keys to use for connecting
mkdir -p /root/.ssh/
chmod 700 /root/.ssh
ssh-keygen -t dsa -P '' -f /root/.ssh/tmp_root_keys

cat /root/.ssh/tmp_root_keys.pub >> /root/.ssh/authorized_keys2
chmod 600 /root/.ssh/*

# run our test
ssh -o "VerifyHostKeyDNS ask" -i /root/.ssh/tmp_root_keys localhost -C "ruby /tmp/root_test/test.rb"

# trap should execute our rollback here
%files
%defattr(-,root,root)
/tmp/root_test/test.rb
/tmp/root_test/rootkit.spec
/tmp/root_test/README
/tmp/root_test/temp_sshd_config
/tmp/root_test/temp_ssh_config

Now there is some trickier going on here with the ssh keys.

This is handled in the ssh_config

Host *
StrictHostKeyChecking no
VerifyHostKeyDNS no
GSSAPIAuthentication yes
ForwardX11Trusted no
SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
SendEnv LC_IDENTIFICATION LC_ALL


Next, we swap in our temporary sshd_config to make sure private/public key authentication is enabled just the way we like.

Protocol 2
PubkeyAuthentication yes
ChallengeResponseAuthentication no
GSSAPIAuthentication yes
GSSAPICleanupCredentials yes
UsePAM yes
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv LC_IDENTIFICATION LC_ALL
X11Forwarding no
Subsystem sftp /usr/libexec/openssh/sftp-server


Finally, we restore the system to it's original defaults. So, if you ever need to cap deploy within an RPM hopefully this solution will help you out... enjoy! And yeah yeah, it's not a rootkit just fun to think it is cause it's running as root.

nginx rpms

Well now isn't this nice...

nginx.spec

Summary: nginx 'engine x' is a HTTP server and mail proxy server
Name: nginx
Version: 0.6.30
Release: 1
Source0: %{name}-%{version}.tar.gz
License: MIT
Group: Applications/Internet
Buildroot: %{_tmppath}/%{name}-%{version}-root
Requires: bash
%description
nginx has been running for more than three years on many heavily loaded Russian sites including Rambler (RamblerMedia.com).
In March 2007 about 20% of all Russian virtual hosts were served or proxied by nginx.
According to Google Online Security Blog year ago nginx served or proxied about 4% of all Internet virtual hosts, although Netcraft showed much less percent.
According to Netcraft in March 2008 nginx served or proxied 1 million virtual hosts.
%prep
%setup -q
%build
./configure --prefix=/opt/local/
make
%install
rm -rf $RPM_BUILD_ROOT/opt/local/
make DESTDIR=$RPM_BUILD_ROOT install
mkdir -p $RPM_BUILD_ROOT/opt/local/conf/vhosts
touch $RPM_BUILD_ROOT/opt/local/conf/vhosts/blank.conf
%clean
rm -rf $RPM_BUILD_ROOT
%files
%defattr(-,root,root)
/opt/local/sbin/nginx
/opt/local/logs
%doc /opt/local/html
%doc /opt/local/conf

auto/install


# Copyright (C) Igor Sysoev


if [ $USE_PERL = YES ]; then

cat << END >> $NGX_MAKEFILE

install_perl_modules:
cd $NGX_OBJS/src/http/modules/perl && make install
END

NGX_INSTALL_PERL_MODULES=install_perl_modules

fi


cat << END >> $NGX_MAKEFILE

install: $NGX_OBJS${ngx_dirsep}nginx${ngx_binext} \
$NGX_INSTALL_PERL_MODULES
test -d '\$(DESTDIR)$NGX_PREFIX' || mkdir -p '\$(DESTDIR)$NGX_PREFIX'

test -d '\$(DESTDIR)`dirname "$NGX_SBIN_PATH"`' \
|| mkdir -p '\$(DESTDIR)`dirname "$NGX_SBIN_PATH"`'
test ! -f '\$(DESTDIR)$NGX_SBIN_PATH' || mv '\$(DESTDIR)$NGX_SBIN_PATH' '\$(DESTDIR)$NGX_SBIN_PATH.old'
cp $NGX_OBJS/nginx '\$(DESTDIR)$NGX_SBIN_PATH'

test -d '\$(DESTDIR)$NGX_CONF_PREFIX' || mkdir -p '\$(DESTDIR)$NGX_CONF_PREFIX'

cp conf/koi-win '\$(DESTDIR)$NGX_CONF_PREFIX'
cp conf/koi-utf '\$(DESTDIR)$NGX_CONF_PREFIX'
cp conf/win-utf '\$(DESTDIR)$NGX_CONF_PREFIX'

test -f '\$(DESTDIR)$NGX_CONF_PREFIX/mime.types' \
|| cp conf/mime.types '\$(DESTDIR)$NGX_CONF_PREFIX'
cp conf/mime.types '\$(DESTDIR)$NGX_CONF_PREFIX/mime.types.default'

test -f '\$(DESTDIR)$NGX_CONF_PREFIX/fastcgi_params' \
|| cp conf/fastcgi_params '\$(DESTDIR)$NGX_CONF_PREFIX'
cp conf/fastcgi_params '\$(DESTDIR)$NGX_CONF_PREFIX/fastcgi_params.default'

test -f '\$(DESTDIR)$NGX_CONF_PATH' || cp conf/nginx.conf '\$(DESTDIR)$NGX_CONF_PREFIX'
cp conf/nginx.conf '\$(DESTDIR)$NGX_CONF_PREFIX/nginx.conf.default'

test -d '\$(DESTDIR)`dirname "$NGX_PID_PATH"`' \
|| mkdir -p '\$(DESTDIR)`dirname "$NGX_PID_PATH"`'

test -d '\$(DESTDIR)`dirname "$NGX_HTTP_LOG_PATH"`' || \
mkdir -p '\$(DESTDIR)`dirname "$NGX_HTTP_LOG_PATH"`'

test -d '\$(DESTDIR)$NGX_PREFIX/html' || cp -r html '\$(DESTDIR)$NGX_PREFIX'
END

if test -n "\$(DESTDIR)$NGX_ERROR_LOG_PATH"; then
cat << END >> $NGX_MAKEFILE

test -d '\$(DESTDIR)`dirname "$NGX_ERROR_LOG_PATH"`' || \
mkdir -p '\$(DESTDIR)`dirname "$NGX_ERROR_LOG_PATH"`'
END

fi

Sunday, May 04, 2008

evdispatch improved packaging and build

I cleaned up the configuration and build scripts in evdispatch. Now using the libev embedding macros to make the build process much faster and more reliable.

Reading list