[Numpy-discussion] `missing` argument in genfromtxt only a string?

Bruce Southey bsouthey@gmail....
Tue Sep 15 10:57:35 CDT 2009


On 09/15/2009 09:44 AM, Skipper Seabold wrote:
> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey<bsouthey@gmail.com>  wrote:
>    
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>      
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<pgmdevlist@gmail.com>    wrote:
>>>
>>>        
>> [snip]
>>      
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>>>          
>>> No worries.  I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>        
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>>>>          
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>>      
> This is relevant to what I've been doing.  I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined.  Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata?  http://www.stata.com/help.cgi?dta
>    
I am not exactly sure what you mean. Most of type formats are available 
under the data set informat statement but really you need to address 
special ones like defining strings with sufficient length and time when 
reading data. Usually I read dates as strings and then convert back 
dates as needed since these are not always correct or have the same 
format in the data.

SAS is rather complex as it has multiple ways to create what it calls 
permanent datasets and these are even incompatible across OS's in the 
same version. So really these are not very useful outside of the 
specific version of SAS that is being used. There are many ways to 
transfer files like using the xport engine that R can read (see 
read.xport  in foreign package - has link to format). However, usually 
it is just easier to create a new file within SAS.

>    
>>
>>      
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>>>>          
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>>      
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw).  To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1,  1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. ,  1.3,  NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
>        dtype='|S5')
>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
>        dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
>        dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1,  1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
>        dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
>
> Skipper
>    
Thanks for these examples as these make sense now. I was confused 
because the display shows the dtype as list not as a single dtype.


Bruce


More information about the NumPy-Discussion mailing list