[Numpy-discussion] `missing` argument in genfromtxt only a string?
Bruce Southey
bsouthey@gmail....
Tue Sep 15 10:57:35 CDT 2009
On 09/15/2009 09:44 AM, Skipper Seabold wrote:
> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey<bsouthey@gmail.com> wrote:
>
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
>>>
>>>
>> [snip]
>>
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>>>
>>> No worries. I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date. The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>>>>
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>>
> This is relevant to what I've been doing. I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined. Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata? http://www.stata.com/help.cgi?dta
>
I am not exactly sure what you mean. Most of type formats are available
under the data set informat statement but really you need to address
special ones like defining strings with sufficient length and time when
reading data. Usually I read dates as strings and then convert back
dates as needed since these are not always correct or have the same
format in the data.
SAS is rather complex as it has multiple ways to create what it calls
permanent datasets and these are even incompatible across OS's in the
same version. So really these are not very useful outside of the
specific version of SAS that is being used. There are many ways to
transfer files like using the xport engine that R can read (see
read.xport in foreign package - has link to format). However, usually
it is just easier to create a new file within SAS.
>
>>
>>
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>>>>
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>>
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw). To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1, 1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. , 1.3, NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
> dtype='|S5')
>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
> dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
> dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1, 1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
> dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
>
> Skipper
>
Thanks for these examples as these make sense now. I was confused
because the display shows the dtype as list not as a single dtype.
Bruce
More information about the NumPy-Discussion
mailing list