[SciPy-dev] Some Q's vis-a-vis Numpy unicode support
David Goldsmith
d_l_goldsmith@yahoo....
Wed Aug 12 10:23:19 CDT 2009
--- On Wed, 8/12/09, josef.pktd@gmail.com <josef.pktd@gmail.com> wrote:
> David Goldsmith<d_l_goldsmith@yahoo.com>
> wrote:
> > Actually, since you seem so into it ;-)
> This was just a refresher, I struggled much more the first
> time I
> tried to use non-english filenames and files.
>
> > can you write me a little script (just 'cause it seems
> like you could do it faster) to print all the unicode
> characters u such >that u == u.capitalize()?
>
> u == u.capitalize() that' s for most of them,
> The webpage lists 89,674 unicode characters and I didn't
> want to try
> all of them.
Thanks again: I guess I did a "narrow Python build" (must be the msi default) cause I had to restrict maxcode to 16**4, but doing so gives umanycap.size = 946. "u == u.capitalize() that' s for most of them," indeed!
DG
>
> Below are the unicode characters in the first 1000 for
> which u != u.capitalize()
>
> josef
>
> -----------------------------
>
> print unichr(30)
>
> maxcode = 1000 # I don't want to try 38000
> start = 0 # 38000 is boring
> umany = np.array([unichr(i) for i in
> xrange(start,start+maxcode)],
>
> '<U1').view(np.chararray)
>
> capmask = (umany != umany.capitalize())
> umanycap = umany[capmask]
>
>
> print umany
> print capmask
> print '%2.2f percent differ in capitalize' %
> (np.sum(capmask)/float(len(umany))*100)
>
> for i in xrange(len(umanycap)):
> try:
> print umanycap[i],
> except:
> print "\n%r doesn't print" %
> umanycap[i]
> --------------
>
> >
> > DG
> >
> > --- On Tue, 8/11/09, josef.pktd@gmail.com
> <josef.pktd@gmail.com>
> wrote:
> >
> >> From: josef.pktd@gmail.com
> <josef.pktd@gmail.com>
> >> Subject: Re: [SciPy-dev] Some Q's vis-a-vis Numpy
> unicode support
> >> To: "SciPy Developers List" <scipy-dev@scipy.org>
> >> Date: Tuesday, August 11, 2009, 8:59 PM
> >> On Tue, Aug 11, 2009 at 11:40 PM,
> >> <josef.pktd@gmail.com>
> >> wrote:
> >> > On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd@gmail.com>
> >> wrote:
> >> >> On Tue, Aug 11, 2009 at 10:28 PM, David
> >> >> Goldsmith<d_l_goldsmith@yahoo.com>
> >> wrote:
> >> >>> Thanks, Josef. This may just be an
> artifact
> >> of working in a DOS Terminal (but your example,
> though not
> >> printing the accented e, did at least print
> something
> >> different for b vs. b.capitalize()), or it may be
> because I
> >> don't know the right encoding to use, but I tried
> your code
> >> w/ what I found on Wikipedia to be the unicode for
> the Greek
> >> letter delta, namely, u'\x03b04', with both
> 'cp1252' and
> >> 'iso8859-7' encoding (the latter being inferred
> from the
> >> same Wikipedia article) and here's what I get:
> >> >>>
> >> >>>>>> b =
> >>
> np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
> >> >>>>>> print
> b.encode('cp1252')[0]
> >> >>> ♥
> >> >>>>>> print
> >> b.capitalize().encode('cp1252')[0]
> >> >>> ♥
> >> >>>>>> print
> b.encode('iso8859-7')[0]
> >> >>> ♥
> >> >>>>>> print
> >> b.capitalize().encode('iso8859-7')[0]
> >> >>> ♥
> >> >>>
> >> >>> i.e., no difference. If I'm doing
> something
> >> wrong, please let me know; otherwise, for the
> purpose of
> >> documenting chararray.capitalize() - which is my
> ultimate
> >> goal - is there any rhyme or reason behind which
> unicode
> >> characters capitalize() works on and which it
> doesn't?
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> DG
> >> >>> --- On Tue, 8/11/09, josef.pktd@gmail.com
> >> <josef.pktd@gmail.com>
> >> wrote:
> >> >>>
> >> >>>> actually this works (in Idle)
> >> >>>>
> >> >>>> >>> b =
> >> >>>>
> >>
> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
> >> >>>> >>> print
> b.encode('cp1252')[0]
> >> >>>> é
> >> >>>> >>> print
> >> b.capitalize().encode('cp1252')[0]
> >> >>>> É
> >> >>>> >>> print
> b[0].encode('cp1252')
> >> >>>> é
> >> >>>>
> >> >>>>
> >> >>>> this looks like a bug ? or is it
> a known
> >> limitation that
> >> >>>> chararrays
> >> >>>> cannot be 0-d
> >> >>>>
> >> >>>> >>> b0=
> >> >>>>
> >> np.array(u'\xe9','<U1').view(np.chararray)
> >> >>>> >>> print
> b0.encode('cp1252')
> >> >>>> Traceback (most recent call
> last):
> >> >>>> File "<pyshell#47>",
> line 1, in
> >> >>>> <module>
> >> >>>> print b0.encode('cp1252')
> >> >>>> File
> >> >>>>
> >>
> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
> >> >>>> line 217, in encode
> >> >>>> return
> >> self._generalmethod('encode',
> >> >>>> broadcast(self, encoding,
> errors))
> >> >>>> File
> >> >>>>
> >>
> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
> >> >>>> line 162, in _generalmethod
> >> >>>> newarr[:] = res
> >> >>>> ValueError: cannot slice a 0-d
> array
> >> >>>>
> >> >>>>
> >> >>>> >
> >> >>>> > Josef
> >> >>>> >
> >> >>>> >>>
> >> >>>> >>> Unless the answer is
> "No," my
> >> real question:
> >> >>>> >>>
> >> >>>> >>> 1) Does
> >> chararray.capitalize() capitalize
> >> >>>> non-Roman letters
> >> >>>> >>> that have different
> >> lower-case and upper-case
> >> >>>> forms (e.g.,
> >> >>>> >>> the Greek
> letters)? If
> >> "yes," are there any
> >> >>>> exceptions
> >> >>>> >>> (e.g., Russian
> letters)?
> >> >>
> >> >> I think yes, exceptions are languages for
> which no
> >> capital letters
> >> >> exist, Cantonese(Chinese) ?
> >> >> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
> >> >> ??? google search for 03B04,
> >> >>
> >> >>>> >>>
> >> >>>> >>> Thanks!
> >> >>>> >>>
> >> >>>> >>> DG
> >> >>>> >>>
> >> >>>> >>>
> >> >>
> >> >> I have problems finding the correct codes
> for the
> >> characters and
> >> >> usually need a word processor.
> >> >>
> >> >> To me it looks like your character is not
> a greek
> >> delta
> >> >>
> >> >>>>> print u'\x03b04'
> >> >> b04
> >> >>>>> print u'\u03b04'
> >> >> ΰ4
> >> >>>>> print u'\u03b4'
> >> >> δ
> >> >>
> >> >> I don't know what it is since it doesn't
> render to
> >> anything meaningful
> >> >>
> >> >> I managed to get the greek delta through
> the html
> >> code for it δ from page:
> >> >> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
> >> >>
> >> >>
> >> >> running this script:
> >> >>
> >> >>
> >> >> # -*- coding: utf-8 -*-
> >> >>
> >> >> sd = u'δ'
> >> >> print sd
> >> >>
> >> >> b =
> >>
> np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
> >> >> print b[0]
> >> >> print repr(b[0])
> >> >> print b.capitalize()[0]
> >> >> print repr(b.capitalize()[0])
> >> >>
> >> >> ***********
> >> >> prints this in my Idle shell
> >> >>>>>
> >> >> δ
> >> >> δ
> >> >> u'\u03b4'
> >> >> Δ
> >> >> u'\u0394'
> >> >>
> >> >> delta is correctly capitalized
> >> >>
> >> >>
> >> >> Josef
> >> >>
> >> >
> >> >
> >> > trying without copy and past non-Ascii
> characters
> >> > the page at
> >> > http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4
> >> >
> >> > also has the utf8 code \xCE\xB4, everything
> looks ok
> >> starting from this.
> >> >
> >> > Josef
> >> >
> >> >>>> '\xCE\xB4'.decode('utf8')
> >> > u'\u03b4'
> >> >>>> print '\xCE\xB4'.decode('utf8')
> >> > δ
> >> >>>> print
> >> '\xCE\xB4'.decode('utf8').capitalize()
> >> > Δ
> >> >>>> b =
> >>
> np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
> >> >>>> b
> >> > chararray([u'\u03b4', u'\u03b4'],
> >> > dtype='<U1')
> >> >>>> print b[0]
> >> > δ
> >> >>>> print b.capitalize()[0]
> >> > Δ
> >> >
> >>
> >> and for the fun of it,
> >> a Russian (cyrillic) character that capitalizes
> >>
> >> >>> print '\xD0\xB9'.decode('utf8')
> >> й
> >> >>> print
> '\xD0\xB9'.decode('utf8').capitalize()
> >> Й
> >> >>> '\xD0\xB9'.decode('utf8')
> >> u'\u0439'
> >> >>>
> '\xD0\xB9'.decode('utf8').capitalize()
> >> u'\u0419'
> >>
> >>
> >> and a german letter that doesn't have a
> capitalized
> >> version
> >>
> >> >>> print
> '\xC3\x9F'.decode('utf8').capitalize()
> >> ß
> >> >>> print '\xC3\x9F'.decode('utf8')
> >> ß
> >> >>> '\xC3\x9F'.decode('utf8')
> >> u'\xdf'
> >> >>>
> '\xC3\x9F'.decode('utf8').capitalize()
> >> u'\xdf'
> >>
> >> and here's a nice picture of unicode 03B04
> >> http://www.cns11643.gov.tw/seeker/english/showfont.jsp?ucode=03B04
> >>
> >> and here are all unicode characters (although my
> browser
> >> doesn't
> >> display most of them)
> >> http://www.isthisthingon.org/unicode/allchars1.php
> >>
> >>
> >> I hope this helps,
> >>
> >> Josef
> >> _______________________________________________
> >> Scipy-dev mailing list
> >> Scipy-dev@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >>
> >
> >
> >
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipy-dev@scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
More information about the Scipy-dev
mailing list