[plt-scheme] Writing aho-corasick recognizer in mzscheme: newbie profiling help?

From: Danny Yoo (dyoo at hkn.eecs.berkeley.edu)
Date: Tue Aug 2 03:38:21 EDT 2005

Hi everyone,

I've been trying to write a nice Aho-Corasick automaton builder for fun,
in pure mzscheme.  For people who are interested, I'm putting my work here
temporarily:

    http://hkn.eecs.berkeley.edu/~dyoo/tmp/ahocorasick.tar.gz

For the most part, I'm happy with it, if not for the fact that the
construction is much slower than I expected.  Concretely, an automaton
construction that takes about ten seconds in one of my equivalent C
libraries takes about a minute in my mzscheme implementation, so I must be
screwing up badly somewhere.  *grin*

(I'm comparing against a Python/C implementation I had worked on
previously at: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/)

Anyway, I do see that the algorithm scales linearly as I send it more
terms, as expected, so the contruction itself is fine as far as big-O is
concerned.  But there's a honking huge time constant that I'm trying to
hunt.

Is a factor of six versus C code something that other people have seen?
If so, I'll should probably stop hunting, and just live with it.  But if I
have a good chance to cut that margin down, I'd love to, just to get a
better experience in performance tuning a mzscheme program.

I have run errortrace.ss's profiler to figure out what's going on, and I
see the following:

#################################################################
dyoo$ mzscheme -r test-with-profiling.ss

[omitting most of the head of the output...]

========================================================
time = 2138 : no. = 18095 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/ahocorasick.ss:100:6>
=========================================================
time = 3192 : no. = 97406 : #f in #<syntax>
=========================================================
time = 4372 : no. = 50762 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/state.ss:111:4>
=========================================================
time = 4727 : no. = 5001 : #f in #<syntax>
=========================================================
time = 5035 : no. = 5002 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/test-with-profiling.ss:13:6>
=========================================================
time = 8608 : no. = 16779 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/state.ss:124:4>
=========================================================
time = 8701 : no. = 1 : prepare in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/ahocorasick.ss:79:2>
=========================================================
time = 13776 : no. = 1 : #f in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/test-with-profiling.ss:12:4>
Total samples: 58437
#################################################################


Much of this makes sense to me --- it's mostly graph construction and
access --- but I'm also trying to puzzle out what lines like:

    time = 4727 : no. = 5001 : #f in #<syntax>

mean.


Any suggestions would be greatly welcome.  Thanks!



Posted on the users mailing list.