[racket-dev] pr 12683 and using something like text:nbsp->space?

From: Eli Barzilay (eli at barzilay.org)
Date: Thu Apr 19 15:48:23 EDT 2012

An hour ago, Danny Yoo wrote:
> On Thu, Apr 12, 2012 at 5:26 PM, Robby Findler
> <robby at eecs.northwestern.edu> wrote:
> > Yes, normalization doesn't deal with those spaces. It does change
> > the text in ways that are unfriendly and I often tell DrRacket
> > "no" when it asks about normalization. I just wanted to put that
> > into the mix for this conversation, since it is a place that has
> > to deal with similar issues.
> 
> I propose a backtrack my current patch, and instead to do the
> following:
> 
> ---
> 
> * Add a set of choices in the editor Preferences pane, with the
>   following options:
> 
>     Treatment of Unicode zero-width characters (such as zero-width spaces):
> 
>     1. Preserve them.
>     2. When introduced, prompt a dialog choice to delete them.
>     3. Automatically delete them.
> 
> with the default preference to be option 2.

I see some problems here that need to be addressed.

The first problem is the definition of "zero-width characters": some
of these are not problematic -- for example, #\u05B0 is something that
gets added to a letter so it doesn't have its own width.  OTOH, there
are many other sources of confusion that are not at all related to
width, like #\u0392 which is usually even displayed using the same "B"
character so there's no visual difference.

The second problem is the thir option offering to just delete them.
Since I view a "proper" solution as something that can deal with all
of these problems, plain deletion is obviously not always the right
solution.

The third problem is something that I already mentioned: even if both
of the above points are addressed, what if I choose #3 because it
seems like an easy way to avoid such problems, and later I get bitten
when I paste some text with an intention of keeping these things in?
There's no way to avoid it by saying that it's only a few people who
would run into these things -- since these people are exactly the kind
of people who are likely to suffer these results.  (IOW, if I deal
with weird texts, I'm likely to get nagged a lot and choose #3, and
I'm also likely to want these things in strings.)

So I think that this should be revised as follows:

1. Drop the whole "zero-width", and instead just use something that
   indicates "potentially confusing".  (I'm surprised that this thread
   keeps focusing on just zero-width spaces.)

2. Change #2 to some form of "normalization".  (That's a bad term
   since it has a specific sense, but I'm sure that there's some term
   somewhere for these kind of changes.)

3. Remove option #3.

Alternatively: add a display mode that "spells out" all of the fishy
characters, as done in Emacs when you open a file in literal mode.


> * Collect the set of zero-width characters.  Zero-width spaces, of
> course, but also see what other Unicode characters exhibit similar
> weird behavior.

(I completely agree with this -- the list of these things will grow;
only not restricted to zero-width-ness.)

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.