[plt-scheme] Compression dictionary

From: Eli Barzilay (eli at barzilay.org)
Date: Mon Oct 5 13:49:33 EDT 2009

On Oct  5, Jens Axel Søgaard wrote:
> 2009/10/5 Eli Barzilay <eli at barzilay.org>:
> > On Oct  5, Jens Axel Søgaard wrote:
> >> 2009/10/5 Eli Barzilay <eli at barzilay.org>:
> >> > Does anyone know of a good and simple method for building a dictionary
> >> > for compression?
> >>
> >> > Explanation: The documentation index file was ~2mb initially, and now
> >> > it's up to 3mb.  In addition, some thing I did for compression make
> >> > loading it slower (like nested arrays which are used like Sexprs) so
> >> > I'm revising the whole thing.
> >>
> >> > Example>
> >>
> >> > "foo_bar"
> >> >  "meh_blah_foo_blah"
> >>
> >> I understand the tokens are "foo", "bar", "meh", and, "blah".
> >
> > Well, I'm working with just raw strings -- trying to get meaningful
> > tokens is going down "regexp-hell"...  So in that example I had
> > "_blah" as a token in one example, and "foo_" in the other.
> 
> Okay, so what the actual tokens used by the algorithm is not as important
> as fast decoding is.
> 
> Is it possible to make a back-of-the-envelope calculation with
> respect to compression rate, download time, and decoding time?

Heh -- you could -- but the critical point here is is (I think) not
the download time, but the time it takes the browser to parse the 3mb
file and execute it to get the vector of data into memory.


> Just to get a feeling of the sizes involved:
> 
> jasmacair:tmp jensaxelsoegaard$ ls -las index.html
> 7960 -rw-r--r--  1 jensaxelsoegaard  wheel  4071868  4 Okt 19:33 index.html
> jasmacair:tmp jensaxelsoegaard$ gzip index.html
> jasmacair:tmp jensaxelsoegaard$ ls -las index.html.gz
> 648 -rw-r--r--  1 jensaxelsoegaard  wheel  330511  4 Okt 19:33 index.html.gz
> 
> The original file size is 4071868 bytes and a gzipped version is
> only 330511. The gzipped version is thus only 8% of the original.
> 
> Question: Does the PLT web server support on-the-fly gzip
> compression?

I don't think so, but that's a side-issue, since I'm dealing with
static files that are being served through apache (as does
docs.plt-scheme.org).


> I suppose it does (I think, I saw a gzip-stuffer some where).
> 
> Is it used for docs.plt-scheme.org?
> 
> NB: The 8% might not be directly applicable, since the file contains
> a lot of html.

To clarify, the problem I'm tackling is the javascript index used for
searching -- the one you get if you go to docs.plt-scheme.org/search/
or if you do a search from drscheme.  It's basically the delay you get
when you open the search -- which includes both the network time and
the JS execution time, and if you do it locally then it's all just JS
time.  To see how the index file looks like now, open
"plt/doc/search/plt-index.js" (but be careful with editors that will
try to do highlights etc).

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the users mailing list.