[SciPy-dev] Some Q's vis-a-vis Numpy unicode support

josef.pktd@gmai... josef.pktd@gmai...
Tue Aug 11 22:59:59 CDT 2009


On Tue, Aug 11, 2009 at 11:40 PM, <josef.pktd@gmail.com> wrote:
> On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd@gmail.com> wrote:
>> On Tue, Aug 11, 2009 at 10:28 PM, David
>> Goldsmith<d_l_goldsmith@yahoo.com> wrote:
>>> Thanks, Josef.  This may just be an artifact of working in a DOS Terminal (but your example, though not printing the accented e, did at least print something different for b vs. b.capitalize()), or it may be because I don't know the right encoding to use, but I tried your code w/ what I found on Wikipedia to be the unicode for the Greek letter delta, namely, u'\x03b04', with both 'cp1252' and 'iso8859-7' encoding (the latter being inferred from the same Wikipedia article) and here's what I get:
>>>
>>>>>> b = np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
>>>>>> print b.encode('cp1252')[0]
>>>>>>>>> print b.capitalize().encode('cp1252')[0]
>>>>>>>>> print b.encode('iso8859-7')[0]
>>>>>>>>> print b.capitalize().encode('iso8859-7')[0]
>>>>>>
>>> i.e., no difference.  If I'm doing something wrong, please let me know; otherwise, for the purpose of documenting chararray.capitalize() - which is my ultimate goal - is there any rhyme or reason behind which unicode characters capitalize() works on and which it doesn't?
>>>
>>> Thanks,
>>>
>>> DG
>>> --- On Tue, 8/11/09, josef.pktd@gmail.com <josef.pktd@gmail.com> wrote:
>>>
>>>> actually this works (in Idle)
>>>>
>>>> >>> b =
>>>> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
>>>> >>> print b.encode('cp1252')[0]
>>>> é
>>>> >>> print b.capitalize().encode('cp1252')[0]
>>>> É
>>>> >>> print b[0].encode('cp1252')
>>>> é
>>>>
>>>>
>>>> this looks like a bug ? or is it a known limitation that
>>>> chararrays
>>>> cannot be 0-d
>>>>
>>>> >>> b0=
>>>> np.array(u'\xe9','<U1').view(np.chararray)
>>>> >>> print b0.encode('cp1252')
>>>> Traceback (most recent call last):
>>>>   File "<pyshell#47>", line 1, in
>>>> <module>
>>>>     print b0.encode('cp1252')
>>>>   File
>>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>>> line 217, in encode
>>>>     return self._generalmethod('encode',
>>>> broadcast(self, encoding, errors))
>>>>   File
>>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>>> line 162, in _generalmethod
>>>>     newarr[:] = res
>>>> ValueError: cannot slice a 0-d array
>>>>
>>>>
>>>> >
>>>> > Josef
>>>> >
>>>> >>>
>>>> >>> Unless the answer is "No," my real question:
>>>> >>>
>>>> >>> 1) Does chararray.capitalize() capitalize
>>>> non-Roman letters
>>>> >>> that have different lower-case and upper-case
>>>> forms (e.g.,
>>>> >>> the Greek letters)?  If "yes," are there any
>>>> exceptions
>>>> >>> (e.g., Russian letters)?
>>
>> I think yes, exceptions are languages for which no capital letters
>> exist, Cantonese(Chinese) ?
>> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
>>  ??? google search for 03B04,
>>
>>>> >>>
>>>> >>> Thanks!
>>>> >>>
>>>> >>> DG
>>>> >>>
>>>> >>>
>>
>> I have problems finding the correct codes for the characters and
>> usually need a word processor.
>>
>> To me it looks like your character is not a greek delta
>>
>>>>> print u'\x03b04'
>>  b04
>>>>> print u'\u03b04'
>> ΰ4
>>>>> print u'\u03b4'
>> δ
>>
>> I don't know what it is since it doesn't render to anything meaningful
>>
>> I managed to get the greek delta through the html code for it &#948; from page:
>> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
>>
>>
>> running this script:
>>
>>
>> # -*- coding: utf-8 -*-
>>
>> sd = u'δ'
>> print sd
>>
>> b = np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
>> print b[0]
>> print repr(b[0])
>> print b.capitalize()[0]
>> print repr(b.capitalize()[0])
>>
>> ***********
>> prints this in my Idle shell
>>>>>
>> δ
>> δ
>> u'\u03b4'
>> Δ
>> u'\u0394'
>>
>> delta is correctly capitalized
>>
>>
>> Josef
>>
>
>
> trying without copy and past non-Ascii characters
> the page at
> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4
>
> also has the utf8 code \xCE\xB4,  everything looks ok starting from this.
>
> Josef
>
>>>> '\xCE\xB4'.decode('utf8')
> u'\u03b4'
>>>> print '\xCE\xB4'.decode('utf8')
> δ
>>>> print '\xCE\xB4'.decode('utf8').capitalize()
> Δ
>>>> b = np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
>>>> b
> chararray([u'\u03b4', u'\u03b4'],
>      dtype='<U1')
>>>> print b[0]
> δ
>>>> print b.capitalize()[0]
> Δ
>

and for the fun of it,
a Russian (cyrillic) character that capitalizes

>>> print '\xD0\xB9'.decode('utf8')
й
>>> print '\xD0\xB9'.decode('utf8').capitalize()
Й
>>> '\xD0\xB9'.decode('utf8')
u'\u0439'
>>> '\xD0\xB9'.decode('utf8').capitalize()
u'\u0419'


and a german letter that doesn't have a capitalized version

>>> print '\xC3\x9F'.decode('utf8').capitalize()
ß
>>> print '\xC3\x9F'.decode('utf8')
ß
>>> '\xC3\x9F'.decode('utf8')
u'\xdf'
>>> '\xC3\x9F'.decode('utf8').capitalize()
u'\xdf'

and here's a nice picture of unicode 03B04
http://www.cns11643.gov.tw/seeker/english/showfont.jsp?ucode=03B04

and here are all unicode characters (although my browser doesn't
display most of them)
http://www.isthisthingon.org/unicode/allchars1.php


I hope this helps,

Josef


More information about the Scipy-dev mailing list