[racket] handin-server hanging

From: Ryan Golbeck (rmgolbec at cs.ubc.ca)
Date: Mon Oct 25 15:48:08 EDT 2010

We're still having problems with this.  It seems to occur randomly,
but I just figured out a way to replicate the behaviour of the server
that maybe you have some comments on.  I'm not sure if something like
this is causing the actual problem we're having, but the behaviour
turns out to be exactly the same.

If I telnet to our handin, and just leave the connection open, but
send no information at all, the handin-server hangs in exactly the
same way I described below: its unresponsive to Ctrl+C, it denies
further connections from handin-clients, etc.

Once I kill this telnet connection manually, the server continues as normal.

Reading the code it seems like this connection will never be watched
by the watcher, and terminated after a session-timeout has passed.  It
looks like it may be waiting for some initial data to setup an SSL
connection, is this correct?

Since this is a problem regardless, because of DoS attacks, what would
be an appropriate solution?

Thanks,
-ryan

On Tue, Sep 28, 2010 at 4:40 PM, Eli Barzilay <eli at barzilay.org> wrote:
> Random points:
>
> * The "cleaning up" log message doesn't mean much -- it happens every
>  few minutes if there were any submissions.  Adding log lines in
>  random places (including at the end of a cleanup) is probably a good
>  idea to see where the problem is.  (And if it's a problem, it's
>  possible to disable the cleanup, but the directory structure will
>  become more complex -- I can give you the information if needed.)
>
> * If the process gets stuck this way, then it's in some deadlock,
>  probably related to the submission process.  To abort it more
>  conveniently, you can use Ctrl+\ on a linux terminal.
>
> * JFYI, the highest number of students that we've had here was about
>  120 -- so you're definitely pushing more limits.  The server should
>  generally be fine though -- it uses `run-server' which looks like
>  it's hard-wired to 5 simultaneous connections.  Still, it's probably
>  a good idea to keep an eye on the process and see that it's not
>  growing up to a thrashing point.
>
> * Especially relevant to look at the process when it gets stuck.  A
>  very convenient linux utility for this is `htop' -- see the memory
>  and the state of the process, and also hit "s" and it will connect
>  to the process with strace and show you the system calls it's doing
>  (if any).
>
> * Also, running it on NFS is probably not a great idea -- some NFSs
>  can have subtle behavior with many operations.  I think that it's
>  better to run it on a local filesystem, with a cron job to copy it
>  to the NFS for backup.
>
>
> 10 minutes ago, Ryan Golbeck wrote:
>> Has anyone had any problems with their handin-server hanging?
>>
>> Twice recently we've had our handin-server seemingly hang.  The last
>> log message on both hangs was that it was "Cleaning up all
>> submission directories".  I've added new log messages to determine
>> if it ever finishes cleaning the submission directories should this
>> happen again, but I haven't been able to reproduce the hang easily.
>> We have around 400 students in the class using the handin-server, so
>> the number of submissions is adding up quickly.  The server is
>> running on a host where the submission directories are served over
>> NFS, which does seem to be slow in some cases, and could be a source
>> of Racket hanging for some reason.
>>
>> The process wouldn't respond to a Ctrl+C interrupt on the terminal it
>> was running when it was hung; I had to send it a signal using kill to
>> get it to stop.
>>
>> Anyone have any suggestions on where I should look for this?  I'll try
>> to post more information as soon as I get it.
>
> --
>          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
>                    http://barzilay.org/                   Maze is Life!
>
>


Posted on the users mailing list.