<div dir="ltr">Symbols are stored internally in utf-8, I believe.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jan 15, 2013 at 5:14 PM, Danny Yoo <span dir="ltr"><<a href="mailto:dyoo@hashcollision.org" target="_blank">dyoo@hashcollision.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">>> >> 1. First, pull all the content of the input port into a string<br>
>> >> port. This cut down the runtime from 52 seconds to 45<br>
>> >> seconds. (15% improvement)<br>
>> ><br>
>> > I don't think that this is a good idea -- it looks lie a dangerous<br>
>> > assumption for a generic library to do, instead of letting users<br>
>> > decide for themselves if they want to do so and hand the port to<br>
>> > the library.<br>
<br>
</div>Wow. Ok, I see what you mean now, and yeah, my optimization here is<br>
unsound. I did not know the JSON library behaved in a streaming<br>
manner. Thanks!<br>
<div class="im"><br>
<br>
<br>
>> When I watch `top` and see how much memory's being used in the<br>
>> original code, I think this is a red herring, for the unoptimized<br>
>> json parser is already consuming around 500MB of ram on J G Cho's<br>
>> 92MB file during the parse.<br>
><br>
> Is the *result* 500mb or the memory used while parsing? If it's the<br>
> former, then that's not the consumption that is increased. (BTW, if<br>
> most of it is made of strings, then we get the 4x UCS32 factor.) If<br>
> it's the latter then I'm surprised.<br>
<br>
</div>Yeah, the input JSON file is full of string literals from casual<br>
inspection, so I think you're right about the UCS32 explanation. It's<br>
too bad; I had assumed that Racket used utf-8, since I've seen so many<br>
instances of bytes->string/utf-8 in Racket code.<br>
<div class="im"><br>
<br>
<br>
>> >> 2. Modified read-list so it avoids using regular expressions when<br>
>> >> simpler peek-char/read-char operations suffice. Reduced the runtime<br>
>> >> from 45 seconds to 40 seconds. (12% improvement)<br>
>> ><br>
>> > This is a questionable change, IMO. The thing is that keeping<br>
>> > things with regexps makes it easy to revise and modify in the<br>
>> > future, but switching to a single character thing makes it hard<br>
>> > and in addition requires the code to know when to use regexps and<br>
>> > when to use a character. I prefer in this case the code<br>
>> > readability over performance.<br>
<br>
</div>Ok, I'll abandon this specific patch for now.<br>
<br>
It sounds though that Ray Racine mentioned that his TR-ed version of<br>
the code performs faster than the non-TRed version? Ray, do you have<br>
that version available somewhere to play with?<br>
<br>
---<br>
<br>
I did push master with one change to the JSON library: the replacement<br>
of the non-greedy regexp with the char-complement version. I also<br>
added several test cases to make sure I got it right.<br>
<br>
Thanks again for the review!<br>
<div class="HOEnZb"><div class="h5">____________________<br>
Racket Users list:<br>
<a href="http://lists.racket-lang.org/users" target="_blank">http://lists.racket-lang.org/users</a><br>
</div></div></blockquote></div><br></div>