You're right. Even if I partition my data (say 2 gb chunks) I'm probably not that much faster than disk. (based on robby's data)<br>I think I better start reading the ports library docs. (or stick to document sets <100mb)<br> <br>s.<br><br><br><div class="gmail_quote">On Tue, Nov 4, 2008 at 7:28 PM, Eli Barzilay <span dir="ltr"><<a href="mailto:eli@barzilay.org" target="_blank">eli@barzilay.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <div>On Nov  4, Stephen De Gabrielle wrote:<br> > I'm working with the Enron email collection, uncompressed it is 2.54<br> > Gb(across 500k files) , so it should be possible to play with the<br> > whole thing in RAM.<br> <br> </div>Just in case you plan to actually do that: at these sizes multipler<br> factors become things that you should be aware of:<br> <br> * In general, the GC requires more memory than you actually use.  I<br>  think that generally speaking you should plan on it holding twice<br>  the ram that you actually need.  (Even though it can be smaller with<br>  generations.)<br> <br> * MzScheme holds strings in UCS-4 format, so each character is 4<br>  bytes.<br> <br> In other words, you might need around 20gb of ram just to read it all<br> in.<br> <div><br> --<br>          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:<br>                  <a href="http://www.barzilay.org/" target="_blank">http://www.barzilay.org/</a>                 Maze is Life!<br> _________________________________________________<br> </div><div><div></div><div>  For list-related administrative tasks:<br>  <a href="http://list.cs.brown.edu/mailman/listinfo/plt-scheme" target="_blank">http://list.cs.brown.edu/mailman/listinfo/plt-scheme</a><br> </div></div></blockquote></div><br>