[plt-scheme] Writing aho-corasick recognizer in mzscheme: newbie profiling help?
Hi everyone,
I've been trying to write a nice Aho-Corasick automaton builder for fun,
in pure mzscheme. For people who are interested, I'm putting my work here
temporarily:
http://hkn.eecs.berkeley.edu/~dyoo/tmp/ahocorasick.tar.gz
For the most part, I'm happy with it, if not for the fact that the
construction is much slower than I expected. Concretely, an automaton
construction that takes about ten seconds in one of my equivalent C
libraries takes about a minute in my mzscheme implementation, so I must be
screwing up badly somewhere. *grin*
(I'm comparing against a Python/C implementation I had worked on
previously at: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/)
Anyway, I do see that the algorithm scales linearly as I send it more
terms, as expected, so the contruction itself is fine as far as big-O is
concerned. But there's a honking huge time constant that I'm trying to
hunt.
Is a factor of six versus C code something that other people have seen?
If so, I'll should probably stop hunting, and just live with it. But if I
have a good chance to cut that margin down, I'd love to, just to get a
better experience in performance tuning a mzscheme program.
I have run errortrace.ss's profiler to figure out what's going on, and I
see the following:
#################################################################
dyoo$ mzscheme -r test-with-profiling.ss
[omitting most of the head of the output...]
========================================================
time = 2138 : no. = 18095 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/ahocorasick.ss:100:6>
=========================================================
time = 3192 : no. = 97406 : #f in #<syntax>
=========================================================
time = 4372 : no. = 50762 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/state.ss:111:4>
=========================================================
time = 4727 : no. = 5001 : #f in #<syntax>
=========================================================
time = 5035 : no. = 5002 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/test-with-profiling.ss:13:6>
=========================================================
time = 8608 : no. = 16779 : loop in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/state.ss:124:4>
=========================================================
time = 8701 : no. = 1 : prepare in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/ahocorasick.ss:79:2>
=========================================================
time = 13776 : no. = 1 : #f in
#<syntax:/Users/dyoo/work/aztec/scratch/plt/ahocorasick/test-with-profiling.ss:12:4>
Total samples: 58437
#################################################################
Much of this makes sense to me --- it's mostly graph construction and
access --- but I'm also trying to puzzle out what lines like:
time = 4727 : no. = 5001 : #f in #<syntax>
mean.
Any suggestions would be greatly welcome. Thanks!