[plt-scheme] Compression dictionary
On Oct 5, Jens Axel Søgaard wrote:
> 2009/10/5 Eli Barzilay <eli at barzilay.org>:
> > Does anyone know of a good and simple method for building a dictionary
> > for compression?
>
> > Explanation: The documentation index file was ~2mb initially, and now
> > it's up to 3mb. In addition, some thing I did for compression make
> > loading it slower (like nested arrays which are used like Sexprs) so
> > I'm revising the whole thing.
>
> > Example>
>
> > "foo_bar"
> > "meh_blah_foo_blah"
>
> I understand the tokens are "foo", "bar", "meh", and, "blah".
Well, I'm working with just raw strings -- trying to get meaningful
tokens is going down "regexp-hell"... So in that example I had
"_blah" as a token in one example, and "foo_" in the other.
> How many bytes do you need to store the tokens alone? How many
> different tokens are there?
The actual JS representation of the first example will look close to:
search_data = ["$1_bar", "meh$2_$1$2"];
dictionary = ["foo", "_blah"];
That is: numbers in the data marked in a way that makes them into
pointers to the dictionary, which holds strings.
The reason for using this over some sexpr-like thing like
search_data = [[1, "_bar"], ["meh", 2, "_", 1, 2]];
is that loading/executing JS code that has flat strings seemed to be
considerably faster than nested arrays.
--
((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay:
http://barzilay.org/ Maze is Life!