[plt-scheme] Compression dictionary

From: Eli Barzilay (eli at barzilay.org)
Date: Mon Oct 5 13:09:37 EDT 2009

On Oct  5, Jens Axel Søgaard wrote:
> 2009/10/5 Eli Barzilay <eli at barzilay.org>:
> > Does anyone know of a good and simple method for building a dictionary
> > for compression?
> 
> > Explanation: The documentation index file was ~2mb initially, and now
> > it's up to 3mb.  In addition, some thing I did for compression make
> > loading it slower (like nested arrays which are used like Sexprs) so
> > I'm revising the whole thing.
> 
> > Example>
> 
> > "foo_bar"
> >  "meh_blah_foo_blah"
> 
> I understand the tokens are "foo", "bar", "meh", and, "blah".

Well, I'm working with just raw strings -- trying to get meaningful
tokens is going down "regexp-hell"...  So in that example I had
"_blah" as a token in one example, and "foo_" in the other.


> How many bytes do you need to store the tokens alone?  How many
> different tokens are there?

The actual JS representation of the first example will look close to:

  search_data = ["$1_bar", "meh$2_$1$2"];
  dictionary = ["foo", "_blah"];

That is: numbers in the data marked in a way that makes them into
pointers to the dictionary, which holds strings.

The reason for using this over some sexpr-like thing like

  search_data = [[1, "_bar"], ["meh", 2, "_", 1, 2]];

is that loading/executing JS code that has flat strings seemed to be
considerably faster than nested arrays.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the users mailing list.