[plt-scheme] reading a whole file

From: Richard Cleis (rcleis at mac.com)
Date: Tue Nov 4 17:04:35 EST 2008


On Nov 4, 2008, at 12:36 PM, Ethan Herdrick <info at reatlas.com> wrote:

> Isn't it silly that we all have our own version of this?

Do we not all have our own versions because the requirements, os's,  
and environments aren't always the same?

This discussion has revealed that separators, error messages, and size  
are issues that warrant the use of different solutions.  Isn't it  
better to write the function in less time than would be required to  
sift through SRFI's?

Isn't it also true that the first pass through a file ought to be  
doing 'one step more' than simply creating a Moby String Object?  In  
other words, efficiency impressions can be skewed by the adoption of  
general file I/O functions, whereas specific functions can be written  
to match the problem.

RAC

> Mine has
> some very basic error handling but probably doesn't do the right thing
> about Unicode.  The inverse, something like string->file, is also
> indispensible.  These functions and their like are some of the first
> things you need for hacking useful things up.  Shouldn't they be in a
> SFRI?  Better yet, built in?
>
>
> On Tue, Nov 4, 2008 at 12:03 PM, Stephen De Gabrielle
> <spdegabrielle at gmail.com> wrote:
>> You're right. Even if I partition my data (say 2 gb chunks) I'm  
>> probably not
>> that much faster than disk. (based on robby's data)
>> I think I better start reading the ports library docs. (or stick to  
>> document
>> sets <100mb)
>>
>> s.
>>
>>
>> On Tue, Nov 4, 2008 at 7:28 PM, Eli Barzilay <eli at barzilay.org>  
>> wrote:
>>>
>>> On Nov  4, Stephen De Gabrielle wrote:
>>>> I'm working with the Enron email collection, uncompressed it is  
>>>> 2.54
>>>> Gb(across 500k files) , so it should be possible to play with the
>>>> whole thing in RAM.
>>>
>>> Just in case you plan to actually do that: at these sizes multipler
>>> factors become things that you should be aware of:
>>>
>>> * In general, the GC requires more memory than you actually use.  I
>>> think that generally speaking you should plan on it holding twice
>>> the ram that you actually need.  (Even though it can be smaller with
>>> generations.)
>>>
>>> * MzScheme holds strings in UCS-4 format, so each character is 4
>>> bytes.
>>>
>>> In other words, you might need around 20gb of ram just to read it  
>>> all
>>> in.
>>>
>>> --
>>>         ((lambda (x) (x x)) (lambda (x) (x x)))          Eli  
>>> Barzilay:
>>>                 http://www.barzilay.org/                 Maze is  
>>> Life!
>>> _________________________________________________
>>> For list-related administrative tasks:
>>> http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>>
>>
>> _________________________________________________
>> For list-related administrative tasks:
>> http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>>
>>
> _________________________________________________
>  For list-related administrative tasks:
>  http://list.cs.brown.edu/mailman/listinfo/plt-scheme


Posted on the users mailing list.