[plt-scheme] Problems fetching page

From: Jens Axel Søgaard (jensaxel at soegaard.net)
Date: Sat Apr 24 11:53:14 EDT 2004

I want to extract some information (the list of news) from the
page <http://www.drv.dk>. Unfortunately it is redirected and thus
the functions GET-PURE-PORT and GET-IMPURE-PORT give me problems.

The first problem is that I'm not told about the redirection:

   > (copy-port (get-impure-port (string->url "http://www.drv.dk/"))
                (current-output-port))
   HTTP/1.1 500 Server Error
   Server: Microsoft-IIS/5.0
   Date: Sat, 24 Apr 2004 15:43:14 GMT
   Content-Type: text/html
   Content-Length: 102

   <html><head><title>Error</title></head><body>The system cannot find the file specified.
   </body></html>

Using wget I found out that the proper address is <http://www.drv.dk/default_frontpage.aspx?siteid=1>.
The second problem is that GET-IMPURE-PORT behaves the same with
the new address:

    > (copy-port (get-impure-port (string->url "http://www.drv.dk/default_frontpage.aspx?siteid=1"))
                 (current-output-port))
    HTTP/1.1 500 Server Error
    Server: Microsoft-IIS/5.0
    Date: Sat, 24 Apr 2004 15:35:04 GMT
    Content-Type: text/html
    Content-Length: 102

    <html><head><title>Error</title></head><body>The system cannot find the file specified.
    </body></html>

Just to be sure the address is correct I fetched it again using wget:

   > (require (lib "process.ss" "mzlib"))
   > (system "c:/cygwin/bin/wget \"http://www.drv.dk/default_frontpage.aspx?siteid=1\"")
   --17:37:02--  http://www.drv.dk/default_frontpage.aspx?siteid=1
              => `default_frontpage.aspx at siteid=1.5'
   Resolving www.drv.dk...

   213.150.32.111
   Connecting to www.drv.dk[213.150.32.111]:80... connected.
   HTTP request sent, awaiting response...
   200 OK
   Length: 40,271 [text/html]
     0K .......... ....
     ...... .......... .........            100%   58.44 KB/s

   17:37:05 (58.44 KB/s) - `default_frontpage.aspx at siteid=1.5' saved [40271/40271]

   #t

The .5 in the final filename is because it's the fifth time I fetch it.


And its the proper contents too:

 > (call-with-input-file "default_frontpage.aspx at siteid=1.5"
     (lambda (port)
       (copy-port port (current-output-port))))

     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
     <HTML><HEAD>
	<TITLE>Det Radikale Venstre</TITLE>
	<LINK REL="STYLESHEET" HREF="style.css">
	<SCRIPT LANGUAGE="JavaScript1.2" SRC="hiermenu/HM_Loader.js"
                 TYPE="text/javascript"></SCRIPT>
	<SCRIPT LANGUAGE="JavaScript" TYPE="text/javascript">
     ...
     [rest of front page deleted]


Is there a way to persuade GET-PURE-PORT and GET-IMPURE-PORT to fetch the page?

Note: The above was on a WindowsXP-machine with IE 6.0


-- 
Jens Axel Søgaard



Posted on the users mailing list.