[racket] more liberal CSV parsing?

From: Neil Van Dyke (neil at neilvandyke.org)
Date: Sat Jul 31 05:22:04 EDT 2010

Shriram Krishnamurthi wrote at 07/30/2010 09:53 PM:
>   wmic process get commandline /format:csv
>
> on Windows 7.  You get lines like
>
> ROUBAIX,"C:\Program Files\Windows Sidebar\sidebar.exe" /autoRun,2940
> ROUBAIX,"C:\Users\sk\Local Settings\Apps\F.lux\flux.exe" /noshow,3048
>   

In this particular case, it looks like the quotes are intended to be 
part of the value of the field, and that the format that "wmic" is 
writing not using any CSV quoting.  So you can just do:

    ((make-csv-reader (open-input-string "d1,d2,\"foo\" bar,d3")
                      '((quote-char . #f))))

    ;; ==> ("d1" "d2" "\"foo\" bar" "d3")

I think that's the parse they intend for their format, and is actually 
helpful for separately parsing the command line.  I suspect that this 
command line CSV field with the quotes in it came from a system call, 
and that "wmic" simply wrote that string verbatim, with a comma before 
and after.

Note that we could change the *default* CSV reader to handle this 
particular example, by having it fall back to putting the quotes back 
into the value if it sees junk after the end of what it thought was a 
CSV-quoted field.  But that would be a kludge that would fail in some 
other cases.  For the "csv" reusable library, I lean towards giving the 
programmer a heads-up that the format is really not something that the 
default CSV reader is likely to parse reliably.

> Mmph: it looks like the output may be just plain broken.  I see
> another line that looks like
>
> ROUBAIX,cthelper 49170 xterm :erase=^?:size=24,80,4064
>
> which looks like it has one field too many....
>   

Good catch.  I think that parsing this particular output with regexps or 
simpler string operations (to separate into fields at only the first and 
last commas) might give best results, since it looks like Microsoft is 
incorrectly not quoting their CSV fields at all in this case.

-- 
http://www.neilvandyke.org/


Posted on the users mailing list.