[racket-dev] Optional equality predicate for assoc and member

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sun Apr 24 10:44:02 EDT 2011

I've pushed the change that makes `assoc', `assv', and `assq'
implemented in Racket, including the new optional argument for `assoc'.

At Fri, 22 Apr 2011 19:52:34 -0400, Eli Barzilay wrote:
> 7 minutes ago, Matthew Flatt wrote:
> > 
> > Switching to a Racket implementation costs time, but (with the JIT)
> > it's not a huge difference --- similar to the cost of switching the C
> > code from 32-bit to 64-bit mode on my machine. The JIT even manages to
> > make `assq' slightly faster than C in 64-bit mode.
> 
> Is there some obvious reason for the huge difference in improvement
> between the 32 and the 64 bits (Almost twice slower and roughly the
> same resp.)?

Like Noel, I thought at first it might be memory-bound. Now I think
it's just that the compiler on my machine (or the way the compiler is
configured) handles x86_64 not as well as x86.

> > The cost for a non-JIT platform is why I hadn't changed `assq',
> > `assv', and `assoc' before. But a plain-Racket implementation is
> > surely better in the long run, and we can keep the C code and bind
> > functions conditionally when the JIT is disabled. Maybe we can
> > eventually close the gap between the JIT-generated code and C. Also,
> > I think the JIT could be smarter about `equal?', which could tilt
> > the balance in favor of the JIT for `assoc'.
> 
> (Maybe it'll be feasible to dump jit-generated machine code at some
> point...?)

Just so you can see the generated code? I think Sam's tool is on the
right track. Otherwise, I'm not sure what you're suggesting.


> > I'm happy that the compiler is doing its job, but inlining
> > optimizations are fragile, so I'm a little reluctant to rely on
> > them.  Sad but true: the macro approach feels more comfortable for
> > now.
> 
> If by "macro approach" you refer to what I was talking about, then
> isn't that needed to have versions with inlined comparators?

Every time I look into changing `map' or other functions into macros, I
conclude that it would be better to implement cross-module inlining.


At Fri, 22 Apr 2011 19:27:49 -0500, Robby Findler wrote:
> What are the non-JIT platforms nowadays?

Sparc and ARM, probably. I expect Sparc to fade away, and I imagine
adding ARM support to the JIT, eventually.

Currently, the Racket implementation of `assq', etc., is used always.
I've thought about ways to fall back to the C implementation when the
JIT is unavailable, but I haven't settled on anything yet.



Below are revised results for the performance test. The old
C-implemented `assq' is now `kernel:assq', and the Racket-implemented
`assq2' is now `assq'.

The improvements are from adjusting the Racket code slightly and
improving the JIT in various small ways. There's more room for
improvement, obviously, and I stopped only because I've had enough for
now.

For those who are *really* interested, I've included annotated,
disassembled JIT output at the end of this message. The JIT-generated
core loop and the gcc-generated core loop are very similar; there's
four times as much JIT-generated code because it unrolls the loop to
four copies. (Both versions test two elements of the list in each
iteration, because the "turtle" to detect cycles is incremented only with
the second test.)

----------------------------------------
32-bit run:

'kern:assq
cpu time: 23 real time: 23 gc time: 0
cpu time: 38 real time: 39 gc time: 0
cpu time: 39 real time: 38 gc time: 0
'assq
cpu time: 38 real time: 39 gc time: 0
cpu time: 60 real time: 61 gc time: 0
cpu time: 61 real time: 61 gc time: 0

'kern:assoc
cpu time: 60 real time: 60 gc time: 0
cpu time: 563 real time: 567 gc time: 0
cpu time: 554 real time: 555 gc time: 0
'assoc
cpu time: 74 real time: 74 gc time: 0
cpu time: 589 real time: 592 gc time: 0
cpu time: 582 real time: 584 gc time: 0

----------------------------------------
64-bit run:

'kern:assq
cpu time: 36 real time: 36 gc time: 0
cpu time: 117 real time: 117 gc time: 0
cpu time: 118 real time: 118 gc time: 0
'assq
cpu time: 39 real time: 39 gc time: 0
cpu time: 66 real time: 65 gc time: 0
cpu time: 65 real time: 65 gc time: 0

'kern:assoc
cpu time: 81 real time: 81 gc time: 0
cpu time: 860 real time: 860 gc time: 0
cpu time: 798 real time: 798 gc time: 0
'assoc
cpu time: 83 real time: 83 gc time: 0
cpu time: 782 real time: 789 gc time: 0
cpu time: 738 real time: 739 gc time: 0


----------------------------------------

The main `assq' loop.
Arguments: list, turtle; environment: value to find, original list.

0x0676707a:	mov    0xc(%ebx),%eax [get list off stack]
0x0676707d:	test   $0x1,%al
0x06767080:	jne    0x676756b      [to fail: fixnum instead of list]
0x06767086:	movswl (%eax),%ecx
0x06767089:	cmp    $0x34,%ecx
0x0676708f:	jne    0x676756b      [to fail: not a pair]
0x06767095:	mov    0x4(%eax),%eax [car => item]
0x06767098:	mov    %eax,-0x4(%ebx) [save item on stack]
0x0676709b:	add    $0xfffffffc,%ebx [adjust stack pointer]
0x0676709e:	test   $0x1,%al
0x067670a1:	jne    0x676752a      [to fail: item is fixnum]
0x067670a7:	movswl (%eax),%ecx
0x067670aa:	cmp    $0x34,%ecx
0x067670b0:	jne    0x676752a      [to fail: item is not a pair]
0x067670b6:	mov    0x4(%eax),%eax [car of item]
0x067670b9:	mov    %eax,%ecx
0x067670bb:	mov    0x8(%ebx),%eax [get key to find off stack]
0x067670be:	cmp    %ecx,%eax      [found key?]
0x067670c0:	jne    0x67670cd        [jump if not...]
0x067670c6:	mov    (%ebx),%eax    [yes... get item from stack]
0x067670c8:	jmp    0x6767525        [and return item]
0x067670cd:	mov    0x10(%ebx),%eax [no... get list from stack]
0x067670d0:	mov    0x8(%eax),%eax  [cdr of list]
0x067670d3:	mov    %eax,-0x4(%ebx) [save new list to stack]
0x067670d6:	add    $0xfffffffc,%ebx [adjust stack pointer]
0x067670d9:	mov    0x18(%ebx),%ecx [get turtle from stack]
0x067670dc:	cmp    %ecx,%eax       [(eq? turtle list)]
0x067670de:	jne    0x6767111         [normally jumps...]
0x067670e4:	mov    $0x67675c0,%eax [fail: found cycle...]
0x067670e9:	mov    (%eax),%eax       [... call error function ...]
0x067670eb:	mov    %eax,-0xc(%ebx)
0x067670ee:	mov    $0x67675c4,%eax
0x067670f3:	mov    (%eax),%eax
0x067670f5:	mov    %eax,-0x8(%ebx)
0x067670f8:	mov    0x10(%ebx),%eax
0x067670fb:	mov    %eax,-0x4(%ebx)
0x067670fe:	mov    $0x34540,%esi
0x06767103:	add    $0xfffffff4,%ebx
0x06767106:	mov    %ebx,0x44c(%edi)
0x0676710c:	jmp    0x376c90
0x06767111:	mov    (%ebx),%eax      [get list from stack... (redundant!)]
0x06767113:	test   $0x1,%al          [... starts another iteration of above]
0x06767116:	jne    0x67674e0         [... and there are 4 copies total]
0x0676711c:	movswl (%eax),%ecx
0x0676711f:	cmp    $0x34,%ecx
0x06767125:	jne    0x67674e0
0x0676712b:	mov    0x4(%eax),%eax
0x0676712e:	mov    %eax,-0x4(%ebx)
0x06767131:	add    $0xfffffffc,%ebx
0x06767134:	test   $0x1,%al
0x06767137:	jne    0x676749f
0x0676713d:	movswl (%eax),%ecx
0x06767140:	cmp    $0x34,%ecx
0x06767146:	jne    0x676749f
0x0676714c:	mov    0x4(%eax),%eax
0x0676714f:	mov    %eax,%ecx
0x06767151:	mov    0x10(%ebx),%eax
0x06767154:	cmp    %ecx,%eax
0x06767156:	jne    0x6767163
0x0676715c:	mov    (%ebx),%eax
0x0676715e:	jmp    0x676749a
0x06767163:	mov    0x1c(%ebx),%eax
0x06767166:	mov    0x8(%eax),%eax
0x06767169:	mov    %eax,-0x4(%ebx)
0x0676716c:	mov    0x4(%ebx),%eax
0x0676716f:	mov    0x8(%eax),%eax
0x06767172:	mov    %eax,-0x8(%ebx)
0x06767175:	add    $0xfffffff8,%ebx
0x06767178:	mov    0x4(%ebx),%ecx
0x0676717b:	cmp    %ecx,%eax
0x0676717d:	jne    0x67671b0
0x06767183:	mov    $0x67675c8,%eax
0x06767188:	mov    (%eax),%eax
0x0676718a:	mov    %eax,-0xc(%ebx)
0x0676718d:	mov    $0x67675cc,%eax
0x06767192:	mov    (%eax),%eax
0x06767194:	mov    %eax,-0x8(%ebx)
0x06767197:	mov    0x1c(%ebx),%eax
0x0676719a:	mov    %eax,-0x4(%ebx)
0x0676719d:	mov    $0x34540,%esi
0x067671a2:	add    $0xfffffff4,%ebx
0x067671a5:	mov    %ebx,0x44c(%edi)
0x067671ab:	jmp    0x376c90
0x067671b0:	mov    (%ebx),%eax       [get list from stack... (redundant!)]
0x067671b2:	test   $0x1,%al
0x067671b5:	jne    0x6767455
0x067671bb:	movswl (%eax),%ecx
0x067671be:	cmp    $0x34,%ecx
0x067671c4:	jne    0x6767455
0x067671ca:	mov    0x4(%eax),%eax
0x067671cd:	mov    %eax,-0x4(%ebx)
0x067671d0:	add    $0xfffffffc,%ebx
0x067671d3:	test   $0x1,%al
0x067671d6:	jne    0x6767414
0x067671dc:	movswl (%eax),%ecx
0x067671df:	cmp    $0x34,%ecx
0x067671e5:	jne    0x6767414
0x067671eb:	mov    0x4(%eax),%eax
0x067671ee:	mov    %eax,%ecx
0x067671f0:	mov    0x1c(%ebx),%eax
0x067671f3:	cmp    %ecx,%eax
0x067671f5:	jne    0x6767202
0x067671fb:	mov    (%ebx),%eax
0x067671fd:	jmp    0x676740f
0x06767202:	mov    0x4(%ebx),%eax
0x06767205:	mov    0x8(%eax),%eax
0x06767208:	mov    %eax,-0x4(%ebx)
0x0676720b:	add    $0xfffffffc,%ebx
0x0676720e:	mov    0xc(%ebx),%ecx
0x06767211:	cmp    %ecx,%eax
0x06767213:	jne    0x6767246
0x06767219:	mov    $0x67675d0,%eax
0x0676721e:	mov    (%eax),%eax
0x06767220:	mov    %eax,-0xc(%ebx)
0x06767223:	mov    $0x67675d4,%eax
0x06767228:	mov    (%eax),%eax
0x0676722a:	mov    %eax,-0x8(%ebx)
0x0676722d:	mov    0x24(%ebx),%eax
0x06767230:	mov    %eax,-0x4(%ebx)
0x06767233:	mov    $0x34540,%esi
0x06767238:	add    $0xfffffff4,%ebx
0x0676723b:	mov    %ebx,0x44c(%edi)
0x06767241:	jmp    0x376c90
0x06767246:	mov    (%ebx),%eax      [get list from stack... (redundant!)]
0x06767248:	test   $0x1,%al
0x0676724b:	jne    0x67673ca
0x06767251:	movswl (%eax),%ecx
0x06767254:	cmp    $0x34,%ecx
0x0676725a:	jne    0x67673ca
0x06767260:	mov    0x4(%eax),%eax
0x06767263:	mov    %eax,-0x4(%ebx)
0x06767266:	add    $0xfffffffc,%ebx
0x06767269:	test   $0x1,%al
0x0676726c:	jne    0x6767389
0x06767272:	movswl (%eax),%ecx
0x06767275:	cmp    $0x34,%ecx
0x0676727b:	jne    0x6767389
0x06767281:	mov    0x4(%eax),%eax
0x06767284:	mov    %eax,%ecx
0x06767286:	mov    0x24(%ebx),%eax
0x06767289:	cmp    %ecx,%eax
0x0676728b:	jne    0x6767298
0x06767291:	mov    (%ebx),%eax
0x06767293:	jmp    0x6767384
0x06767298:	mov    0x10(%ebx),%eax
0x0676729b:	mov    0x8(%eax),%eax
0x0676729e:	mov    %eax,-0x4(%ebx)
0x067672a1:	mov    0x4(%ebx),%eax
0x067672a4:	mov    0x8(%eax),%eax
0x067672a7:	mov    %eax,-0x8(%ebx)
0x067672aa:	add    $0xfffffff8,%ebx
0x067672ad:	mov    0x4(%ebx),%ecx
0x067672b0:	cmp    %ecx,%eax
0x067672b2:	jne    0x67672e5
0x067672b8:	mov    $0x67675d8,%eax
0x067672bd:	mov    (%eax),%eax
0x067672bf:	mov    %eax,-0xc(%ebx)
0x067672c2:	mov    $0x67675dc,%eax
0x067672c7:	mov    (%eax),%eax
0x067672c9:	mov    %eax,-0x8(%ebx)
0x067672cc:	mov    0x30(%ebx),%eax
0x067672cf:	mov    %eax,-0x4(%ebx)
0x067672d2:	mov    $0x34540,%esi
0x067672d7:	add    $0xfffffff4,%ebx
0x067672da:	mov    %ebx,0x44c(%edi)
0x067672e0:	jmp    0x376c90
0x067672e5:	mov    (%ebx),%eax      [get list from stack... (redundant!)]
0x067672e7:	mov    %eax,-0x8(%ebx)  [... save it(!) to prep recursive call]
0x067672ea:	mov    0x4(%ebx),%eax   [get turtle from stack]
0x067672ed:	add    $0xfffffff8,%ebx [adjust stack pointer]
0x067672f0:	lea    0x64(%edi),%edx  [check whether other threads needs to run...]
0x067672f3:	mov    (%edx),%edx
0x067672f5:	cmp    $0x0,%edx
0x067672fb:	jle    0x6767312        [to slow: for thread swap]
0x067672fd:	mov    -0x1c(%ebp),%edx [argv = end of orig argv (tail call)]
0x06767300:	add    $0xffffffec,%edx  [... minus orig argc for args in-place]
0x06767303:	mov    %eax,0x10(%edx)  [set turtle argument for recur]
0x06767306:	mov    (%ebx),%eax      [load list back]
0x06767308:	mov    %eax,0xc(%edx)   [... and set list argument for recur]
0x0676730b:	mov    %edx,%ebx        [set Racket stack to argv]
0x0676730d:	jmp    0x676707a        [recur]

----------------------------------------
For comparsion, the gcc-compiled `assq' function's loop:

0x008c6480 <assq+112>:	mov    -0x20(%ebp),%edx   [get l off stack]
0x008c6483 <assq+115>:	test   $0x1,%dl
0x008c6486 <assq+118>:	jne    0x8c6523 <assq+275> [to fail: l is fixnum]
0x008c648c <assq+124>:	cmpw   $0x34,(%edx)
0x008c6490 <assq+128>:	jne    0x8c6523 <assq+275> [to fail: not a pair]
0x008c6496 <assq+134>:	mov    0x4(%edx),%ecx      [item = car of l]
0x008c6499 <assq+137>:	mov    %ecx,-0x1c(%ebp)    [save item on stack]
0x008c649c <assq+140>:	test   $0x1,%cl
0x008c649f <assq+143>:	jne    0x8c6588 <assq+376> [to fail: item is fixnum]
0x008c64a5 <assq+149>:	cmpw   $0x34,(%ecx)
0x008c64a9 <assq+153>:	jne    0x8c6588 <assq+376> [to fail: item is not pair]
0x008c64af <assq+159>:	mov    0xc(%ebp),%esi      [get x arg off stack]
0x008c64b2 <assq+162>:	mov    (%esi),%eax         [found x?]
0x008c64b4 <assq+164>:	cmp    0x4(%ecx),%eax
0x008c64b7 <assq+167>:	je     0x8c6648 <scheme_get_thread_local_variables>
0x008c64bd <assq+173>:	mov    0x8(%edx),%edx      [no, so get cdr of l]
0x008c64c0 <assq+176>:	mov    %edx,-0x20(%ebp)
0x008c64c3 <assq+179>:	test   $0x1,%dl
0x008c64c6 <assq+182>:	jne    0x8c6480 <assq+112>  [to fail: cdr is fixnum]
0x008c64c8 <assq+184>:	cmpw   $0x34,(%edx)
0x008c64cc <assq+188>:	jne    0x8c6480 <assq+112> [to fail: cdr is not a pair]
0x008c64ce <assq+190>:	mov    0x4(%edx),%ecx      [item = car l]
0x008c64d1 <assq+193>:	mov    %ecx,-0x1c(%ebp)    
0x008c64d4 <assq+196>:	test   $0x1,%cl
0x008c64d7 <assq+199>:	jne    0x8c6480 <assq+112> [fail: item is a fixnum]
0x008c64d9 <assq+201>:	cmpw   $0x34,(%ecx)
0x008c64dd <assq+205>:	jne    0x8c6480 <assq+112> [fail: item is not a pair]
0x008c64df <assq+207>:	mov    (%esi),%eax         [get x off stack]
0x008c64e1 <assq+209>:	cmp    0x4(%ecx),%eax      [car of item == x ?]
0x008c64e4 <assq+212>:	je     0x8c6648 <scheme_get_thread_local_variables>
0x008c64ea <assq+218>:	mov    0x8(%edx),%eax
0x008c64ed <assq+221>:	mov    %eax,-0x20(%ebp)
0x008c64f0 <assq+224>:	mov    -0x24(%ebp),%edx  [get turtle off stack]
0x008c64f3 <assq+227>:	cmp    %edx,%eax         [is l == turtle?]
0x008c64f5 <assq+229>:	je     0x8c6523 <assq+275> [to fail: cycle]
0x008c64f7 <assq+231>:	mov    0x8(%edx),%eax    [cdr of turtle]
0x008c64fa <assq+234>:	mov    %eax,-0x24(%ebp)  [save new turtle to stack]
0x008c64fd <scheme_get_thread_local_variables+0>:	mov    (%edi),%eax
0x008c64ff <scheme_get_thread_local_variables+2>:	mov    %gs:0x48(,%eax,4),%eax
0x008c6507 <assq+247>:	mov    0x68(%eax),%eax
0x008c650a <assq+250>:	test   %eax,%eax           [other threads need to run?]
0x008c650c <assq+252>:	jg     0x8c6480 <assq+112> [to slow: other threads run]
0x008c6512 <assq+258>:	movl   $0x4,-0x50(%ebp)
0x008c6519 <assq+265>:	call   0x9e44a0 <scheme_out_of_fuel>
0x008c651e <assq+270>:	jmp    0x8c6480 <assq+112>  [recur]

----------------------------------------

The `assq' wrapper that jumps to the above loop. The argument
juggling here --- turning the arg sequence `x l' to `l l l x'
--- may be part of the slowdown.

0x0037e41c:	cmp    $0x2,%ecx        [arity check]
0x0037e422:	jne    0x372e80
0x0037e428:	mov    %eax,-0x4(%ebx)  [save closure (but no GC possible!)]
0x0037e42b:	add    $0xfffffffc,%ebx
0x0037e42e:	mov    0x4(%ebx),%eax   [get x arg]
0x0037e431:	mov    %eax,-0x10(%ebx) [prep x arg to `loop']
0x0037e434:	mov    0x8(%ebx),%eax   [get l arg]
0x0037e437:	mov    %eax,-0xc(%ebx)  [prep l arg to `loop']
0x0037e43a:	mov    %eax,-0x8(%ebx)  [prep orig-l arg to `loop']
0x0037e43d:	mov    %eax,-0x4(%ebx)  [prep turtle arg to `loop']
0x0037e440:	mov    $0x37e47c,%esi
0x0037e445:	mov    (%esi),%esi      [esi now has `loop' closure pointer]
0x0037e447:	add    $0xfffffff0,%ebx [update Racket stack register]
0x0037e44a:	mov    -0x1c(%ebp),%edx [get Racket stack base for tail calls]
0x0037e44d:	add    $0xfffffff0,%edx [4 argument => subtract 16]
0x0037e450:	mov    0xc(%ebx),%ecx   [move arg1 into place...]
0x0037e453:	mov    %ecx,0xc(%edx)
0x0037e456:	mov    0x8(%ebx),%ecx
0x0037e459:	mov    %ecx,0x8(%edx)
0x0037e45c:	mov    0x4(%ebx),%ecx
0x0037e45f:	mov    %ecx,0x4(%edx)
0x0037e462:	mov    (%ebx),%ecx
0x0037e464:	mov    %ecx,(%edx)
0x0037e466:	mov    %edx,%ebx        [runstack = argv]
0x0037e468:	mov    %esi,%eax
0x0037e46a:	mov    $0x4,%ecx        [argc = 4]
0x0037e46f:	mov    %ebx,%edx        [argv = runstack (redundant!)]
0x0037e471:	jmp    0x0676707a

----------------------------------------
Then again, here's a comparable section of the gcc-compiled `assq'
function's preamble that cooperates with the GC and sets up the loop.

0x008c641f <scheme_get_thread_local_variables+0>:	mov    0x176c06(%ebx),%edi
0x008c6425 <scheme_get_thread_local_variables+6>:	mov    (%edi),%eax
0x008c6427 <scheme_get_thread_local_variables+8>:	mov    %gs:0x48(,%eax,4),%edx
0x008c642f <assq+31>:	mov    0x4(%edx),%edx
0x008c6432 <assq+34>:	mov    %edx,-0x54(%ebp)
0x008c6435 <scheme_get_thread_local_variables+0>:	mov    %gs:0x48(,%eax,4),%eax
0x008c643d <assq+45>:	lea    -0x54(%ebp),%edx
0x008c6440 <assq+48>:	mov    %edx,0x4(%eax)
0x008c6443 <assq+51>:	lea    0xc(%ebp),%eax
0x008c6446 <assq+54>:	mov    %eax,-0x4c(%ebp)
0x008c6449 <assq+57>:	lea    -0x1c(%ebp),%eax
0x008c644c <assq+60>:	mov    %eax,-0x48(%ebp)
0x008c644f <assq+63>:	lea    -0x24(%ebp),%eax
0x008c6452 <assq+66>:	mov    %eax,-0x44(%ebp)
0x008c6455 <assq+69>:	lea    -0x20(%ebp),%eax
0x008c6458 <assq+72>:	mov    %eax,-0x40(%ebp)
0x008c645b <assq+75>:	movl   $0x0,-0x1c(%ebp)
0x008c6462 <assq+82>:	movl   $0x0,-0x20(%ebp)
0x008c6469 <assq+89>:	movl   $0x0,-0x24(%ebp)
0x008c6470 <assq+96>:	mov    0xc(%ebp),%eax
0x008c6473 <assq+99>:	mov    0x4(%eax),%eax
0x008c6476 <assq+102>:	mov    %eax,-0x24(%ebp)
0x008c6479 <assq+105>:	mov    %eax,-0x20(%ebp)
0x008c647c <assq+108>:	nopl   0x0(%eax)

Note the NOP, which aligns the gcc-generated loop start. The JIT
doesn't try to align loop starts, because past experiments suggested
that it doesn't matter. That's worth another look.



Posted on the dev mailing list.