[SciPy-dev] Some Q's vis-a-vis Numpy unicode support

josef.pktd@gmai... josef.pktd@gmai...
Tue Aug 11 23:18:58 CDT 2009


On Tue, Aug 11, 2009 at 11:59 PM, <josef.pktd@gmail.com> wrote:
> On Tue, Aug 11, 2009 at 11:40 PM, <josef.pktd@gmail.com> wrote:
>> On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd@gmail.com> wrote:
>>> On Tue, Aug 11, 2009 at 10:28 PM, David
>>> Goldsmith<d_l_goldsmith@yahoo.com> wrote:
>>>> Thanks, Josef.  This may just be an artifact of working in a DOS Terminal (but your example, though not printing the accented e, did at least print something different for b vs. b.capitalize()), or it may be because I don't know the right encoding to use, but I tried your code w/ what I found on Wikipedia to be the unicode for the Greek letter delta, namely, u'\x03b04', with both 'cp1252' and 'iso8859-7' encoding (the latter being inferred from the same Wikipedia article) and here's what I get:
>>>>
>>>>>>> b = np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
>>>>>>> print b.encode('cp1252')[0]
>>>>>>>>>>> print b.capitalize().encode('cp1252')[0]
>>>>>>>>>>> print b.encode('iso8859-7')[0]
>>>>>>>>>>> print b.capitalize().encode('iso8859-7')[0]
>>>>>>>>
>>>> i.e., no difference.  If I'm doing something wrong, please let me know; otherwise, for the purpose of documenting chararray.capitalize() - which is my ultimate goal - is there any rhyme or reason behind which unicode characters capitalize() works on and which it doesn't?
>>>>
>>>> Thanks,
>>>>
>>>> DG
>>>> --- On Tue, 8/11/09, josef.pktd@gmail.com <josef.pktd@gmail.com> wrote:
>>>>
>>>>> actually this works (in Idle)
>>>>>
>>>>> >>> b =
>>>>> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
>>>>> >>> print b.encode('cp1252')[0]
>>>>> é
>>>>> >>> print b.capitalize().encode('cp1252')[0]
>>>>> É
>>>>> >>> print b[0].encode('cp1252')
>>>>> é
>>>>>
>>>>>
>>>>> this looks like a bug ? or is it a known limitation that
>>>>> chararrays
>>>>> cannot be 0-d
>>>>>
>>>>> >>> b0=
>>>>> np.array(u'\xe9','<U1').view(np.chararray)
>>>>> >>> print b0.encode('cp1252')
>>>>> Traceback (most recent call last):
>>>>>   File "<pyshell#47>", line 1, in
>>>>> <module>
>>>>>     print b0.encode('cp1252')
>>>>>   File
>>>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>>>> line 217, in encode
>>>>>     return self._generalmethod('encode',
>>>>> broadcast(self, encoding, errors))
>>>>>   File
>>>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>>>> line 162, in _generalmethod
>>>>>     newarr[:] = res
>>>>> ValueError: cannot slice a 0-d array
>>>>>
>>>>>
>>>>> >
>>>>> > Josef
>>>>> >
>>>>> >>>
>>>>> >>> Unless the answer is "No," my real question:
>>>>> >>>
>>>>> >>> 1) Does chararray.capitalize() capitalize
>>>>> non-Roman letters
>>>>> >>> that have different lower-case and upper-case
>>>>> forms (e.g.,
>>>>> >>> the Greek letters)?  If "yes," are there any
>>>>> exceptions
>>>>> >>> (e.g., Russian letters)?
>>>
>>> I think yes, exceptions are languages for which no capital letters
>>> exist, Cantonese(Chinese) ?
>>> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
>>>  ??? google search for 03B04,
>>>
>>>>> >>>
>>>>> >>> Thanks!
>>>>> >>>
>>>>> >>> DG
>>>>> >>>
>>>>> >>>
>>>
>>> I have problems finding the correct codes for the characters and
>>> usually need a word processor.
>>>
>>> To me it looks like your character is not a greek delta
>>>
>>>>>> print u'\x03b04'
>>>  b04
>>>>>> print u'\u03b04'
>>> ΰ4
>>>>>> print u'\u03b4'
>>> δ
>>>
>>> I don't know what it is since it doesn't render to anything meaningful
>>>
>>> I managed to get the greek delta through the html code for it &#948; from page:
>>> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
>>>
>>>
>>> running this script:
>>>
>>>
>>> # -*- coding: utf-8 -*-
>>>
>>> sd = u'δ'
>>> print sd
>>>
>>> b = np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
>>> print b[0]
>>> print repr(b[0])
>>> print b.capitalize()[0]
>>> print repr(b.capitalize()[0])
>>>
>>> ***********
>>> prints this in my Idle shell
>>>>>>
>>> δ
>>> δ
>>> u'\u03b4'
>>> Δ
>>> u'\u0394'
>>>
>>> delta is correctly capitalized
>>>
>>>
>>> Josef
>>>
>>
>>
>> trying without copy and past non-Ascii characters
>> the page at
>> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4
>>
>> also has the utf8 code \xCE\xB4,  everything looks ok starting from this.
>>
>> Josef
>>
>>>>> '\xCE\xB4'.decode('utf8')
>> u'\u03b4'
>>>>> print '\xCE\xB4'.decode('utf8')
>> δ
>>>>> print '\xCE\xB4'.decode('utf8').capitalize()
>> Δ
>>>>> b = np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
>>>>> b
>> chararray([u'\u03b4', u'\u03b4'],
>>      dtype='<U1')
>>>>> print b[0]
>> δ
>>>>> print b.capitalize()[0]
>> Δ
>>
>
> and for the fun of it,
> a Russian (cyrillic) character that capitalizes
>
>>>> print '\xD0\xB9'.decode('utf8')
> й
>>>> print '\xD0\xB9'.decode('utf8').capitalize()
> Й
>>>> '\xD0\xB9'.decode('utf8')
> u'\u0439'
>>>> '\xD0\xB9'.decode('utf8').capitalize()
> u'\u0419'
>
>
> and a german letter that doesn't have a capitalized version
>
>>>> print '\xC3\x9F'.decode('utf8').capitalize()
> ß
>>>> print '\xC3\x9F'.decode('utf8')
> ß
>>>> '\xC3\x9F'.decode('utf8')
> u'\xdf'
>>>> '\xC3\x9F'.decode('utf8').capitalize()
> u'\xdf'
>
> and here's a nice picture of unicode 03B04
> http://www.cns11643.gov.tw/seeker/english/showfont.jsp?ucode=03B04
>
> and here are all unicode characters (although my browser doesn't
> display most of them)
> http://www.isthisthingon.org/unicode/allchars1.php
>
>
> I hope this helps,
>
> Josef
>

and then there is also

>>> b = np.array([u'\u03b4\u03b4', u'\u03b4\u03b4'],'<U2').view(np.chararray)
>>> print b.capitalize()
[u'\u0394\u03b4' u'\u0394\u03b4']
>>> print b.capitalize()[0]
Δδ
>>> print b.upper()[0]
ΔΔ
>>> print b.upper().lower()[0]
δδ
>>> print b.title()[0]
Δδ

that's enough fun for the night

Josef


More information about the Scipy-dev mailing list