<div dir="ltr">Symbols are stored internally in utf-8, I believe.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jan 15, 2013 at 5:14 PM, Danny Yoo <span dir="ltr"><<a href="mailto:dyoo@hashcollision.org" target="_blank">dyoo@hashcollision.org</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">>> >> 1. First, pull all the content of the input port into a string<br> >> >> port. This cut down the runtime from 52 seconds to 45<br> >> >> seconds. (15% improvement)<br> >> ><br> >> > I don't think that this is a good idea -- it looks lie a dangerous<br> >> > assumption for a generic library to do, instead of letting users<br> >> > decide for themselves if they want to do so and hand the port to<br> >> > the library.<br> <br> </div>Wow. Ok, I see what you mean now, and yeah, my optimization here is<br> unsound. I did not know the JSON library behaved in a streaming<br> manner. Thanks!<br> <div class="im"><br> <br> <br> >> When I watch `top` and see how much memory's being used in the<br> >> original code, I think this is a red herring, for the unoptimized<br> >> json parser is already consuming around 500MB of ram on J G Cho's<br> >> 92MB file during the parse.<br> ><br> > Is the *result* 500mb or the memory used while parsing? If it's the<br> > former, then that's not the consumption that is increased. (BTW, if<br> > most of it is made of strings, then we get the 4x UCS32 factor.) If<br> > it's the latter then I'm surprised.<br> <br> </div>Yeah, the input JSON file is full of string literals from casual<br> inspection, so I think you're right about the UCS32 explanation. It's<br> too bad; I had assumed that Racket used utf-8, since I've seen so many<br> instances of bytes->string/utf-8 in Racket code.<br> <div class="im"><br> <br> <br> >> >> 2. Modified read-list so it avoids using regular expressions when<br> >> >> simpler peek-char/read-char operations suffice. Reduced the runtime<br> >> >> from 45 seconds to 40 seconds. (12% improvement)<br> >> ><br> >> > This is a questionable change, IMO. The thing is that keeping<br> >> > things with regexps makes it easy to revise and modify in the<br> >> > future, but switching to a single character thing makes it hard<br> >> > and in addition requires the code to know when to use regexps and<br> >> > when to use a character. I prefer in this case the code<br> >> > readability over performance.<br> <br> </div>Ok, I'll abandon this specific patch for now.<br> <br> It sounds though that Ray Racine mentioned that his TR-ed version of<br> the code performs faster than the non-TRed version? Ray, do you have<br> that version available somewhere to play with?<br> <br> ---<br> <br> I did push master with one change to the JSON library: the replacement<br> of the non-greedy regexp with the char-complement version. I also<br> added several test cases to make sure I got it right.<br> <br> Thanks again for the review!<br> <div class="HOEnZb"><div class="h5">____________________<br> Racket Users list:<br> <a href="http://lists.racket-lang.org/users" target="_blank">http://lists.racket-lang.org/users</a><br> </div></div></blockquote></div><br></div>