[Numpy-discussion] `missing` argument in genfromtxt only a string?
Skipper Seabold
jsseabold@gmail....
Tue Sep 15 09:57:17 CDT 2009
On Tue, Sep 15, 2009 at 10:44 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <bsouthey@gmail.com> wrote:
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
>>>
>> [snip]
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>> No worries. I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date. The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>
> This is relevant to what I've been doing. I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined. Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata? http://www.stata.com/help.cgi?dta
>
>>
>>
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw). To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1, 1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. , 1.3, NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
> dtype='|S5')
>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
> dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
> dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1, 1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
> dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
Btw, you don't have to pass it as a dtype. It just needs to be able to pass
if dtype is not None:
dtype = np.dtype(dtype)
I would like to see something like this, as it does when dtype is
None, but then we would have to have a type argument, maybe rather
than a dtype argument.
names = ['var1','var2','var3']
type = ['i', 'f', 'str']
dtype = zip(names,type)
if dtype is not None:
....
Again, while I'm on it...I noticed the argument to specify the
autostrip argument that can be provided to _iotools.LineSplitter is
always False. If this does, what I think (no time to test yet), it
might be nice to be able to specify this in genfromtxt.
Skipper
More information about the NumPy-Discussion
mailing list