[racket-dev] Optional equality predicate for assoc and member
Why is the turtle on the stack in the gcc version?
Robby
On Sun, Apr 24, 2011 at 9:44 AM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
> I've pushed the change that makes `assoc', `assv', and `assq'
> implemented in Racket, including the new optional argument for `assoc'.
>
> At Fri, 22 Apr 2011 19:52:34 -0400, Eli Barzilay wrote:
>> 7 minutes ago, Matthew Flatt wrote:
>> >
>> > Switching to a Racket implementation costs time, but (with the JIT)
>> > it's not a huge difference --- similar to the cost of switching the C
>> > code from 32-bit to 64-bit mode on my machine. The JIT even manages to
>> > make `assq' slightly faster than C in 64-bit mode.
>>
>> Is there some obvious reason for the huge difference in improvement
>> between the 32 and the 64 bits (Almost twice slower and roughly the
>> same resp.)?
>
> Like Noel, I thought at first it might be memory-bound. Now I think
> it's just that the compiler on my machine (or the way the compiler is
> configured) handles x86_64 not as well as x86.
>
>> > The cost for a non-JIT platform is why I hadn't changed `assq',
>> > `assv', and `assoc' before. But a plain-Racket implementation is
>> > surely better in the long run, and we can keep the C code and bind
>> > functions conditionally when the JIT is disabled. Maybe we can
>> > eventually close the gap between the JIT-generated code and C. Also,
>> > I think the JIT could be smarter about `equal?', which could tilt
>> > the balance in favor of the JIT for `assoc'.
>>
>> (Maybe it'll be feasible to dump jit-generated machine code at some
>> point...?)
>
> Just so you can see the generated code? I think Sam's tool is on the
> right track. Otherwise, I'm not sure what you're suggesting.
>
>
>> > I'm happy that the compiler is doing its job, but inlining
>> > optimizations are fragile, so I'm a little reluctant to rely on
>> > them. Sad but true: the macro approach feels more comfortable for
>> > now.
>>
>> If by "macro approach" you refer to what I was talking about, then
>> isn't that needed to have versions with inlined comparators?
>
> Every time I look into changing `map' or other functions into macros, I
> conclude that it would be better to implement cross-module inlining.
>
>
> At Fri, 22 Apr 2011 19:27:49 -0500, Robby Findler wrote:
>> What are the non-JIT platforms nowadays?
>
> Sparc and ARM, probably. I expect Sparc to fade away, and I imagine
> adding ARM support to the JIT, eventually.
>
> Currently, the Racket implementation of `assq', etc., is used always.
> I've thought about ways to fall back to the C implementation when the
> JIT is unavailable, but I haven't settled on anything yet.
>
>
>
> Below are revised results for the performance test. The old
> C-implemented `assq' is now `kernel:assq', and the Racket-implemented
> `assq2' is now `assq'.
>
> The improvements are from adjusting the Racket code slightly and
> improving the JIT in various small ways. There's more room for
> improvement, obviously, and I stopped only because I've had enough for
> now.
>
> For those who are *really* interested, I've included annotated,
> disassembled JIT output at the end of this message. The JIT-generated
> core loop and the gcc-generated core loop are very similar; there's
> four times as much JIT-generated code because it unrolls the loop to
> four copies. (Both versions test two elements of the list in each
> iteration, because the "turtle" to detect cycles is incremented only with
> the second test.)
>
> ----------------------------------------
> 32-bit run:
>
> 'kern:assq
> cpu time: 23 real time: 23 gc time: 0
> cpu time: 38 real time: 39 gc time: 0
> cpu time: 39 real time: 38 gc time: 0
> 'assq
> cpu time: 38 real time: 39 gc time: 0
> cpu time: 60 real time: 61 gc time: 0
> cpu time: 61 real time: 61 gc time: 0
>
> 'kern:assoc
> cpu time: 60 real time: 60 gc time: 0
> cpu time: 563 real time: 567 gc time: 0
> cpu time: 554 real time: 555 gc time: 0
> 'assoc
> cpu time: 74 real time: 74 gc time: 0
> cpu time: 589 real time: 592 gc time: 0
> cpu time: 582 real time: 584 gc time: 0
>
> ----------------------------------------
> 64-bit run:
>
> 'kern:assq
> cpu time: 36 real time: 36 gc time: 0
> cpu time: 117 real time: 117 gc time: 0
> cpu time: 118 real time: 118 gc time: 0
> 'assq
> cpu time: 39 real time: 39 gc time: 0
> cpu time: 66 real time: 65 gc time: 0
> cpu time: 65 real time: 65 gc time: 0
>
> 'kern:assoc
> cpu time: 81 real time: 81 gc time: 0
> cpu time: 860 real time: 860 gc time: 0
> cpu time: 798 real time: 798 gc time: 0
> 'assoc
> cpu time: 83 real time: 83 gc time: 0
> cpu time: 782 real time: 789 gc time: 0
> cpu time: 738 real time: 739 gc time: 0
>
>
> ----------------------------------------
>
> The main `assq' loop.
> Arguments: list, turtle; environment: value to find, original list.
>
> 0x0676707a: mov 0xc(%ebx),%eax [get list off stack]
> 0x0676707d: test $0x1,%al
> 0x06767080: jne 0x676756b [to fail: fixnum instead of list]
> 0x06767086: movswl (%eax),%ecx
> 0x06767089: cmp $0x34,%ecx
> 0x0676708f: jne 0x676756b [to fail: not a pair]
> 0x06767095: mov 0x4(%eax),%eax [car => item]
> 0x06767098: mov %eax,-0x4(%ebx) [save item on stack]
> 0x0676709b: add $0xfffffffc,%ebx [adjust stack pointer]
> 0x0676709e: test $0x1,%al
> 0x067670a1: jne 0x676752a [to fail: item is fixnum]
> 0x067670a7: movswl (%eax),%ecx
> 0x067670aa: cmp $0x34,%ecx
> 0x067670b0: jne 0x676752a [to fail: item is not a pair]
> 0x067670b6: mov 0x4(%eax),%eax [car of item]
> 0x067670b9: mov %eax,%ecx
> 0x067670bb: mov 0x8(%ebx),%eax [get key to find off stack]
> 0x067670be: cmp %ecx,%eax [found key?]
> 0x067670c0: jne 0x67670cd [jump if not...]
> 0x067670c6: mov (%ebx),%eax [yes... get item from stack]
> 0x067670c8: jmp 0x6767525 [and return item]
> 0x067670cd: mov 0x10(%ebx),%eax [no... get list from stack]
> 0x067670d0: mov 0x8(%eax),%eax [cdr of list]
> 0x067670d3: mov %eax,-0x4(%ebx) [save new list to stack]
> 0x067670d6: add $0xfffffffc,%ebx [adjust stack pointer]
> 0x067670d9: mov 0x18(%ebx),%ecx [get turtle from stack]
> 0x067670dc: cmp %ecx,%eax [(eq? turtle list)]
> 0x067670de: jne 0x6767111 [normally jumps...]
> 0x067670e4: mov $0x67675c0,%eax [fail: found cycle...]
> 0x067670e9: mov (%eax),%eax [... call error function ...]
> 0x067670eb: mov %eax,-0xc(%ebx)
> 0x067670ee: mov $0x67675c4,%eax
> 0x067670f3: mov (%eax),%eax
> 0x067670f5: mov %eax,-0x8(%ebx)
> 0x067670f8: mov 0x10(%ebx),%eax
> 0x067670fb: mov %eax,-0x4(%ebx)
> 0x067670fe: mov $0x34540,%esi
> 0x06767103: add $0xfffffff4,%ebx
> 0x06767106: mov %ebx,0x44c(%edi)
> 0x0676710c: jmp 0x376c90
> 0x06767111: mov (%ebx),%eax [get list from stack... (redundant!)]
> 0x06767113: test $0x1,%al [... starts another iteration of above]
> 0x06767116: jne 0x67674e0 [... and there are 4 copies total]
> 0x0676711c: movswl (%eax),%ecx
> 0x0676711f: cmp $0x34,%ecx
> 0x06767125: jne 0x67674e0
> 0x0676712b: mov 0x4(%eax),%eax
> 0x0676712e: mov %eax,-0x4(%ebx)
> 0x06767131: add $0xfffffffc,%ebx
> 0x06767134: test $0x1,%al
> 0x06767137: jne 0x676749f
> 0x0676713d: movswl (%eax),%ecx
> 0x06767140: cmp $0x34,%ecx
> 0x06767146: jne 0x676749f
> 0x0676714c: mov 0x4(%eax),%eax
> 0x0676714f: mov %eax,%ecx
> 0x06767151: mov 0x10(%ebx),%eax
> 0x06767154: cmp %ecx,%eax
> 0x06767156: jne 0x6767163
> 0x0676715c: mov (%ebx),%eax
> 0x0676715e: jmp 0x676749a
> 0x06767163: mov 0x1c(%ebx),%eax
> 0x06767166: mov 0x8(%eax),%eax
> 0x06767169: mov %eax,-0x4(%ebx)
> 0x0676716c: mov 0x4(%ebx),%eax
> 0x0676716f: mov 0x8(%eax),%eax
> 0x06767172: mov %eax,-0x8(%ebx)
> 0x06767175: add $0xfffffff8,%ebx
> 0x06767178: mov 0x4(%ebx),%ecx
> 0x0676717b: cmp %ecx,%eax
> 0x0676717d: jne 0x67671b0
> 0x06767183: mov $0x67675c8,%eax
> 0x06767188: mov (%eax),%eax
> 0x0676718a: mov %eax,-0xc(%ebx)
> 0x0676718d: mov $0x67675cc,%eax
> 0x06767192: mov (%eax),%eax
> 0x06767194: mov %eax,-0x8(%ebx)
> 0x06767197: mov 0x1c(%ebx),%eax
> 0x0676719a: mov %eax,-0x4(%ebx)
> 0x0676719d: mov $0x34540,%esi
> 0x067671a2: add $0xfffffff4,%ebx
> 0x067671a5: mov %ebx,0x44c(%edi)
> 0x067671ab: jmp 0x376c90
> 0x067671b0: mov (%ebx),%eax [get list from stack... (redundant!)]
> 0x067671b2: test $0x1,%al
> 0x067671b5: jne 0x6767455
> 0x067671bb: movswl (%eax),%ecx
> 0x067671be: cmp $0x34,%ecx
> 0x067671c4: jne 0x6767455
> 0x067671ca: mov 0x4(%eax),%eax
> 0x067671cd: mov %eax,-0x4(%ebx)
> 0x067671d0: add $0xfffffffc,%ebx
> 0x067671d3: test $0x1,%al
> 0x067671d6: jne 0x6767414
> 0x067671dc: movswl (%eax),%ecx
> 0x067671df: cmp $0x34,%ecx
> 0x067671e5: jne 0x6767414
> 0x067671eb: mov 0x4(%eax),%eax
> 0x067671ee: mov %eax,%ecx
> 0x067671f0: mov 0x1c(%ebx),%eax
> 0x067671f3: cmp %ecx,%eax
> 0x067671f5: jne 0x6767202
> 0x067671fb: mov (%ebx),%eax
> 0x067671fd: jmp 0x676740f
> 0x06767202: mov 0x4(%ebx),%eax
> 0x06767205: mov 0x8(%eax),%eax
> 0x06767208: mov %eax,-0x4(%ebx)
> 0x0676720b: add $0xfffffffc,%ebx
> 0x0676720e: mov 0xc(%ebx),%ecx
> 0x06767211: cmp %ecx,%eax
> 0x06767213: jne 0x6767246
> 0x06767219: mov $0x67675d0,%eax
> 0x0676721e: mov (%eax),%eax
> 0x06767220: mov %eax,-0xc(%ebx)
> 0x06767223: mov $0x67675d4,%eax
> 0x06767228: mov (%eax),%eax
> 0x0676722a: mov %eax,-0x8(%ebx)
> 0x0676722d: mov 0x24(%ebx),%eax
> 0x06767230: mov %eax,-0x4(%ebx)
> 0x06767233: mov $0x34540,%esi
> 0x06767238: add $0xfffffff4,%ebx
> 0x0676723b: mov %ebx,0x44c(%edi)
> 0x06767241: jmp 0x376c90
> 0x06767246: mov (%ebx),%eax [get list from stack... (redundant!)]
> 0x06767248: test $0x1,%al
> 0x0676724b: jne 0x67673ca
> 0x06767251: movswl (%eax),%ecx
> 0x06767254: cmp $0x34,%ecx
> 0x0676725a: jne 0x67673ca
> 0x06767260: mov 0x4(%eax),%eax
> 0x06767263: mov %eax,-0x4(%ebx)
> 0x06767266: add $0xfffffffc,%ebx
> 0x06767269: test $0x1,%al
> 0x0676726c: jne 0x6767389
> 0x06767272: movswl (%eax),%ecx
> 0x06767275: cmp $0x34,%ecx
> 0x0676727b: jne 0x6767389
> 0x06767281: mov 0x4(%eax),%eax
> 0x06767284: mov %eax,%ecx
> 0x06767286: mov 0x24(%ebx),%eax
> 0x06767289: cmp %ecx,%eax
> 0x0676728b: jne 0x6767298
> 0x06767291: mov (%ebx),%eax
> 0x06767293: jmp 0x6767384
> 0x06767298: mov 0x10(%ebx),%eax
> 0x0676729b: mov 0x8(%eax),%eax
> 0x0676729e: mov %eax,-0x4(%ebx)
> 0x067672a1: mov 0x4(%ebx),%eax
> 0x067672a4: mov 0x8(%eax),%eax
> 0x067672a7: mov %eax,-0x8(%ebx)
> 0x067672aa: add $0xfffffff8,%ebx
> 0x067672ad: mov 0x4(%ebx),%ecx
> 0x067672b0: cmp %ecx,%eax
> 0x067672b2: jne 0x67672e5
> 0x067672b8: mov $0x67675d8,%eax
> 0x067672bd: mov (%eax),%eax
> 0x067672bf: mov %eax,-0xc(%ebx)
> 0x067672c2: mov $0x67675dc,%eax
> 0x067672c7: mov (%eax),%eax
> 0x067672c9: mov %eax,-0x8(%ebx)
> 0x067672cc: mov 0x30(%ebx),%eax
> 0x067672cf: mov %eax,-0x4(%ebx)
> 0x067672d2: mov $0x34540,%esi
> 0x067672d7: add $0xfffffff4,%ebx
> 0x067672da: mov %ebx,0x44c(%edi)
> 0x067672e0: jmp 0x376c90
> 0x067672e5: mov (%ebx),%eax [get list from stack... (redundant!)]
> 0x067672e7: mov %eax,-0x8(%ebx) [... save it(!) to prep recursive call]
> 0x067672ea: mov 0x4(%ebx),%eax [get turtle from stack]
> 0x067672ed: add $0xfffffff8,%ebx [adjust stack pointer]
> 0x067672f0: lea 0x64(%edi),%edx [check whether other threads needs to run...]
> 0x067672f3: mov (%edx),%edx
> 0x067672f5: cmp $0x0,%edx
> 0x067672fb: jle 0x6767312 [to slow: for thread swap]
> 0x067672fd: mov -0x1c(%ebp),%edx [argv = end of orig argv (tail call)]
> 0x06767300: add $0xffffffec,%edx [... minus orig argc for args in-place]
> 0x06767303: mov %eax,0x10(%edx) [set turtle argument for recur]
> 0x06767306: mov (%ebx),%eax [load list back]
> 0x06767308: mov %eax,0xc(%edx) [... and set list argument for recur]
> 0x0676730b: mov %edx,%ebx [set Racket stack to argv]
> 0x0676730d: jmp 0x676707a [recur]
>
> ----------------------------------------
> For comparsion, the gcc-compiled `assq' function's loop:
>
> 0x008c6480 <assq+112>: mov -0x20(%ebp),%edx [get l off stack]
> 0x008c6483 <assq+115>: test $0x1,%dl
> 0x008c6486 <assq+118>: jne 0x8c6523 <assq+275> [to fail: l is fixnum]
> 0x008c648c <assq+124>: cmpw $0x34,(%edx)
> 0x008c6490 <assq+128>: jne 0x8c6523 <assq+275> [to fail: not a pair]
> 0x008c6496 <assq+134>: mov 0x4(%edx),%ecx [item = car of l]
> 0x008c6499 <assq+137>: mov %ecx,-0x1c(%ebp) [save item on stack]
> 0x008c649c <assq+140>: test $0x1,%cl
> 0x008c649f <assq+143>: jne 0x8c6588 <assq+376> [to fail: item is fixnum]
> 0x008c64a5 <assq+149>: cmpw $0x34,(%ecx)
> 0x008c64a9 <assq+153>: jne 0x8c6588 <assq+376> [to fail: item is not pair]
> 0x008c64af <assq+159>: mov 0xc(%ebp),%esi [get x arg off stack]
> 0x008c64b2 <assq+162>: mov (%esi),%eax [found x?]
> 0x008c64b4 <assq+164>: cmp 0x4(%ecx),%eax
> 0x008c64b7 <assq+167>: je 0x8c6648 <scheme_get_thread_local_variables>
> 0x008c64bd <assq+173>: mov 0x8(%edx),%edx [no, so get cdr of l]
> 0x008c64c0 <assq+176>: mov %edx,-0x20(%ebp)
> 0x008c64c3 <assq+179>: test $0x1,%dl
> 0x008c64c6 <assq+182>: jne 0x8c6480 <assq+112> [to fail: cdr is fixnum]
> 0x008c64c8 <assq+184>: cmpw $0x34,(%edx)
> 0x008c64cc <assq+188>: jne 0x8c6480 <assq+112> [to fail: cdr is not a pair]
> 0x008c64ce <assq+190>: mov 0x4(%edx),%ecx [item = car l]
> 0x008c64d1 <assq+193>: mov %ecx,-0x1c(%ebp)
> 0x008c64d4 <assq+196>: test $0x1,%cl
> 0x008c64d7 <assq+199>: jne 0x8c6480 <assq+112> [fail: item is a fixnum]
> 0x008c64d9 <assq+201>: cmpw $0x34,(%ecx)
> 0x008c64dd <assq+205>: jne 0x8c6480 <assq+112> [fail: item is not a pair]
> 0x008c64df <assq+207>: mov (%esi),%eax [get x off stack]
> 0x008c64e1 <assq+209>: cmp 0x4(%ecx),%eax [car of item == x ?]
> 0x008c64e4 <assq+212>: je 0x8c6648 <scheme_get_thread_local_variables>
> 0x008c64ea <assq+218>: mov 0x8(%edx),%eax
> 0x008c64ed <assq+221>: mov %eax,-0x20(%ebp)
> 0x008c64f0 <assq+224>: mov -0x24(%ebp),%edx [get turtle off stack]
> 0x008c64f3 <assq+227>: cmp %edx,%eax [is l == turtle?]
> 0x008c64f5 <assq+229>: je 0x8c6523 <assq+275> [to fail: cycle]
> 0x008c64f7 <assq+231>: mov 0x8(%edx),%eax [cdr of turtle]
> 0x008c64fa <assq+234>: mov %eax,-0x24(%ebp) [save new turtle to stack]
> 0x008c64fd <scheme_get_thread_local_variables+0>: mov (%edi),%eax
> 0x008c64ff <scheme_get_thread_local_variables+2>: mov %gs:0x48(,%eax,4),%eax
> 0x008c6507 <assq+247>: mov 0x68(%eax),%eax
> 0x008c650a <assq+250>: test %eax,%eax [other threads need to run?]
> 0x008c650c <assq+252>: jg 0x8c6480 <assq+112> [to slow: other threads run]
> 0x008c6512 <assq+258>: movl $0x4,-0x50(%ebp)
> 0x008c6519 <assq+265>: call 0x9e44a0 <scheme_out_of_fuel>
> 0x008c651e <assq+270>: jmp 0x8c6480 <assq+112> [recur]
>
> ----------------------------------------
>
> The `assq' wrapper that jumps to the above loop. The argument
> juggling here --- turning the arg sequence `x l' to `l l l x'
> --- may be part of the slowdown.
>
> 0x0037e41c: cmp $0x2,%ecx [arity check]
> 0x0037e422: jne 0x372e80
> 0x0037e428: mov %eax,-0x4(%ebx) [save closure (but no GC possible!)]
> 0x0037e42b: add $0xfffffffc,%ebx
> 0x0037e42e: mov 0x4(%ebx),%eax [get x arg]
> 0x0037e431: mov %eax,-0x10(%ebx) [prep x arg to `loop']
> 0x0037e434: mov 0x8(%ebx),%eax [get l arg]
> 0x0037e437: mov %eax,-0xc(%ebx) [prep l arg to `loop']
> 0x0037e43a: mov %eax,-0x8(%ebx) [prep orig-l arg to `loop']
> 0x0037e43d: mov %eax,-0x4(%ebx) [prep turtle arg to `loop']
> 0x0037e440: mov $0x37e47c,%esi
> 0x0037e445: mov (%esi),%esi [esi now has `loop' closure pointer]
> 0x0037e447: add $0xfffffff0,%ebx [update Racket stack register]
> 0x0037e44a: mov -0x1c(%ebp),%edx [get Racket stack base for tail calls]
> 0x0037e44d: add $0xfffffff0,%edx [4 argument => subtract 16]
> 0x0037e450: mov 0xc(%ebx),%ecx [move arg1 into place...]
> 0x0037e453: mov %ecx,0xc(%edx)
> 0x0037e456: mov 0x8(%ebx),%ecx
> 0x0037e459: mov %ecx,0x8(%edx)
> 0x0037e45c: mov 0x4(%ebx),%ecx
> 0x0037e45f: mov %ecx,0x4(%edx)
> 0x0037e462: mov (%ebx),%ecx
> 0x0037e464: mov %ecx,(%edx)
> 0x0037e466: mov %edx,%ebx [runstack = argv]
> 0x0037e468: mov %esi,%eax
> 0x0037e46a: mov $0x4,%ecx [argc = 4]
> 0x0037e46f: mov %ebx,%edx [argv = runstack (redundant!)]
> 0x0037e471: jmp 0x0676707a
>
> ----------------------------------------
> Then again, here's a comparable section of the gcc-compiled `assq'
> function's preamble that cooperates with the GC and sets up the loop.
>
> 0x008c641f <scheme_get_thread_local_variables+0>: mov 0x176c06(%ebx),%edi
> 0x008c6425 <scheme_get_thread_local_variables+6>: mov (%edi),%eax
> 0x008c6427 <scheme_get_thread_local_variables+8>: mov %gs:0x48(,%eax,4),%edx
> 0x008c642f <assq+31>: mov 0x4(%edx),%edx
> 0x008c6432 <assq+34>: mov %edx,-0x54(%ebp)
> 0x008c6435 <scheme_get_thread_local_variables+0>: mov %gs:0x48(,%eax,4),%eax
> 0x008c643d <assq+45>: lea -0x54(%ebp),%edx
> 0x008c6440 <assq+48>: mov %edx,0x4(%eax)
> 0x008c6443 <assq+51>: lea 0xc(%ebp),%eax
> 0x008c6446 <assq+54>: mov %eax,-0x4c(%ebp)
> 0x008c6449 <assq+57>: lea -0x1c(%ebp),%eax
> 0x008c644c <assq+60>: mov %eax,-0x48(%ebp)
> 0x008c644f <assq+63>: lea -0x24(%ebp),%eax
> 0x008c6452 <assq+66>: mov %eax,-0x44(%ebp)
> 0x008c6455 <assq+69>: lea -0x20(%ebp),%eax
> 0x008c6458 <assq+72>: mov %eax,-0x40(%ebp)
> 0x008c645b <assq+75>: movl $0x0,-0x1c(%ebp)
> 0x008c6462 <assq+82>: movl $0x0,-0x20(%ebp)
> 0x008c6469 <assq+89>: movl $0x0,-0x24(%ebp)
> 0x008c6470 <assq+96>: mov 0xc(%ebp),%eax
> 0x008c6473 <assq+99>: mov 0x4(%eax),%eax
> 0x008c6476 <assq+102>: mov %eax,-0x24(%ebp)
> 0x008c6479 <assq+105>: mov %eax,-0x20(%ebp)
> 0x008c647c <assq+108>: nopl 0x0(%eax)
>
> Note the NOP, which aligns the gcc-generated loop start. The JIT
> doesn't try to align loop starts, because past experiments suggested
> that it doesn't matter. That's worth another look.
>
> _________________________________________________
> For list-related administrative tasks:
> http://lists.racket-lang.org/listinfo/dev
>