[racket] Randomized bisimulation tests and Metropolis-Hastings random walk

From: Jeremiah Willcock (jewillco at osl.iu.edu)
Date: Fri Jan 4 20:50:20 EST 2013

On Fri, 4 Jan 2013, Neil Toronto wrote:

> I expect lookups and in-place updates with the most frequency. The other 
> operation I need to be fast is head or tail insertion (it doesn't matter 
> which), but not nearly as much as the other two. Have I picked the right data 
> structure?

How long are the lists going to be?  What about a double-ended queue 
(deque), which doesn't appear to be in Racket or Planet, or even something 
similar to the two-stack structure used to simulate Turing machine tapes 
on pushdown automata with two stacks (i.e., have two growable vectors, one 
for the first part of the list backwards, and a second for the second part 
forward)?

-- Jeremiah Willcock

>
> On 01/04/2013 06:39 PM, Jeremiah Willcock wrote:
>> How long are these lists going to be, and how many insertions/deletions
>> will occur relative to the number of element lookups and in-place
>> updates? Would it make sense to consider other random-access data
>> structures (in particular, growable vectors) instead of lists?
>> 
>> -- Jeremiah Willcock
>> 
>> On Fri, 4 Jan 2013, Neil Toronto wrote:
>> 
>>> My doctoral research is to beat the crap out of STAN, WinBUGS, etc.,
>>> etc., by leveraging information you can only get by writing a compiler
>>> for a language with a formal definition.
>>> 
>>> There's no Bayesian DSL that can answer queries with arbitrary
>>> conditions like "X^2+Y^2 = 1.2". MCMC fails spectacularly. In fact,
>>> you can't even reason about those using Bayes' law for densities. But
>>> I've got a backend implementation based on a measure-theoretic
>>> semantics that can sample from distributions with conditions like
>>> that. :D
>>> 
>>> I needed a Typed Racket implementation of random-access lists to make
>>> it scale well for models with more than 20 random variables or so.
>>> 
>>> Neil ⊥
>>> 
>>> On 01/04/2013 05:50 PM, Ray Racine wrote:
>>>> MCMC with Gibbs Sampling and MH for straightforward stuff is
>>>> straightforward, but the subtitles of underflow, use log space or not
>>>> etc are something you guys know quite a bit more about than I do.
>>>> 
>>>> FWIW, a few months ago I was doing some custom Racket coded for
>>>> straightforward Gibbs (mainly) and MH in one case for customer related
>>>> data analysis, but the water gets deep quickly beyond the
>>>> straightforward, and one quickly questions the validity of their custom
>>>> sampler against  considerations of a proven MCMC library.  I did a brief
>>>> amount research on the current state of things with regards to WinBugs,
>>>> OpenBugs, Jags, ... After a quick overview, I sort of narrowed things
>>>> down to a relatively new arrival, STAN. https://code.google.com/p/stan/
>>>> 
>>>> It's a new rewrite, based on a new sampling algorithm variation
>>>> where existent libraries are getting long in the tooth.  It generates
>>>> the sampling code rather then interprets BUGS DSL code and claims to
>>>> take some pains to support FFI binding (R of course) and embedding.
>>>> 
>>>> Just wanted to mention STAN if you haven't run across it yet,  as
>>>> opposed to well know standby MCMC libs.
>>>> 
>>>> One day ... an FFI Racket <-> STAN would be very cool.
>>>> 
>>>> 
>>>> On Jan 4, 2013 4:47 PM, "Neil Toronto" <neil.toronto at gmail.com
>>>> <mailto:neil.toronto at gmail.com>> wrote:
>>>>
>>>>     I get excited about applying statistics to programming. Here's
>>>>     something exciting I found today.
>>>>
>>>>     I'm working on a Typed Racket implementation of Chris Okasaki's
>>>>     purely functional random-access lists, which are O(1) for `cons',
>>>>     `first' and `rest', and basically O(log(n)) for random access. I
>>>>     wanted solid randomized tests, and I figured the best way would be
>>>>     bisimulation: do exactly the same, random thing to a (Listof
>>>>     Integer) and an (RAList Integer), then ensure the lists are the
>>>>     same. Iterate N times, each time using the altered lists for the
>>>>     next bisimulation step.
>>>>
>>>>     There's a problem with this, though: the test runtimes are all over
>>>>     the place for any fixed N, and are especially wild with large N.
>>>>     Sometimes it first does a bunch of `cons' operations in a row.
>>>>     Because each bisimulation step is O(n) in the list length (to check
>>>>     equality), this makes the remaining steps very slow. Sometimes it
>>>>     follows up with a bunch of `rest' operations, which makes the
>>>>     remaining steps very fast.
>>>>
>>>>     More precisely, random bisimulation does a "simple random walk" over
>>>>     test list lengths, where `cons' is a step from n to n+1 and `rest'
>>>>     is a step from n to n-1. Unfortunately, the n's stepped on in a
>>>>     simple random walk have no fixed probability distribution, and thus
>>>>     no fixed average or variance. That means each bisimulation step has
>>>>     no fixed average runtime or fixed variation in runtime. Wildness
>>>> ensues.
>>>>
>>>>     One possible solution is to generate fresh test lists from a fixed
>>>>     distribution, for each bisimulation step. This is a bad solution,
>>>>     though, because I want to see whether RAList instances behave
>>>>     correctly *over time* as they're operated on.
>>>>
>>>>     The right solution is to use a Metropolis-Hastings random walk
>>>>     instead of a simple random walk, to sample list lengths from a fixed
>>>>     distribution. Then, the list lengths will have a fixed average,
>>>>     meaning that each bisimulation step will have a fixed average
>>>>     runtime, and my overall average test runtime will be O(N) instead of
>>>>     wild and crazy.
>>>>
>>>>     First, I choose a distribution. The geometric family is good because
>>>>     list lengths are nonnegative. Geometric with p = 0.05 means list
>>>>     lengths should be (1-p)/p = 19 on average, but are empty with
>>>>     probability 0.05 and very large with very low probability.
>>>>
>>>>     So I add these two lines:
>>>>
>>>>        (require math/distributions)
>>>>
>>>>        (define length-dist (geometric-dist 0.05))
>>>>
>>>>     In a Metropolis-Hastings random walk, you step from n to n' with
>>>>     probability min(1,P(n')/P(n)); this is called "accepting" the step.
>>>>     Otherwise, you "reject" the step by staying in the same spot. Adding
>>>>     this new test takes only two defines and a `cond':
>>>>
>>>>        (define r (random 5))  ; Randomly choose an operation
>>>>        (cond
>>>>          [(= r 0)
>>>>           ;; New! Only cons with probability P(n+1)/P(n)
>>>>           (define n (length lst))
>>>>           (define accept (/ (pdf length-dist (+ n 1))
>>>>                             (pdf length-dist n)))
>>>>           (cond [((random) . < . accept)
>>>>                  .... Accept: cons and ensure continued equality ....]
>>>>                 [else
>>>>                  .... Reject: return the same lists ....])]
>>>>          [(= r 1)
>>>>           .... Accept: `rest' each list and ensure continued equality
>>>> ....]
>>>>          .... Other random operations ....)
>>>>
>>>>     I left off the acceptance test for `rest' because P(n-1) > P(n) ==>
>>>>     P(n-1)/P(n) > 1, so the test would always pass. (This is because the
>>>>     geometric distribution's pdf is monotone. If I had chosen a Poisson
>>>>     distribution, which has a bump in the middle, I'd have to do the
>>>>     acceptance test for both `cons' and `rest'.)
>>>>
>>>>     I instrument the test loop to record list lengths, verify that their
>>>>     mean is about 19, and do some density plots vs. the geometric
>>>>     distribution. It all looks good: the randomized bisimulation test
>>>>     does indeed operate on lists whose length is distributed
>>>>     Geometric(0.05), so the overall average test runtime is O(N), and it
>>>>     still tests them as they're operated on over time.
>>>>
>>>>     Neil ⊥
>>>>     ____________________
>>>>       Racket Users list:
>>>>     http://lists.racket-lang.org/__users
>>>>     <http://lists.racket-lang.org/users>
>>>> 
>>> 
>>> ____________________
>>> Racket Users list:
>>> http://lists.racket-lang.org/users
>>> 
>

Posted on the users mailing list.