[SciPy-User] Suggestion for numpy.genfromtxt documentation

Dharhas Pothina Dharhas.Pothina@twdb.state.tx...
Mon Oct 12 09:13:45 CDT 2009


Hi All,

Before I start I wanted to let all of you know that I really appreciate the work that has gone into genfromtxt. It is a hugely useful function that has become indispensable in my work. A lot of the problems I have in general come from the fact that I am a fairly new python/numpy user and don't always understand some of the intricacies involved.

Just a disclaimer. I am not familiar enough with the way genfromtxt works to have understood the entire discussion that followed my posting, so I'm going to answer the questions I can answer.

>>> Bruce Southey <bsouthey@gmail.com> 10/7/2009 2:20 PM >>>
>>> What did you actually expect?
>>> It would be very informative if you could provide a simple example of 
>>> this for testing.

Coming from a Matlab background the first thing I would have expected when given an option to read in (or otherwise define column) variables is a structure which lets me know what the name of each column is. In matlab this would be a variable say 'a' such that a.header is a list of header names and a.data has the data in a 2D array such that column 'n' has the data associated with a.header[n].

Now since I've become fairly used to the way python does things, my modified expectation is if I read a file with the data below:

10.0 20.1 30.7
10.0 30.2 40.3
20.1 21.3 67.5
...

with the command: a = np.genfromtxt(fname,usecols=(0,1,2),names='x,y,z')

I should get a structured array

such that a['x'] = np.array([10.0,10.0,20.1,...])

etc.

If you would like a sample data file I can provide one.

>>> There are many combinations of arguments so not all have been tested and 
>>> it is not always clear what the expected behavior should be.

I think for me the confusion is in an initial lack of understanding on how dtypes work. If I type help np.genfromtxt in Ipython I get:

names : {None, True, string, sequence}, optional
    If `names` is True, the field names are read from the first valid line
    after the first `skiprows` lines.
    If `names` is a sequence or a single-string of comma-separated names,
    the names will be used to define the field names in a flexible dtype.
    If `names` is None, the names of the dtype fields will be used, if any.

My understanding of this was that the names argument would be used to define the field names. What I didn't realize is that if the dtype is not explicitly set (or set equal to None) then since all the data in the files are floats the dtype for the entire array is float rather than each column having its own dtype. So there are no column specific dtypes whose field names can be set to the values I specified and the file names I set are ignored (at least that's what I think is happening)

To me the reason for having the 'names' argument is so that there is a mechanism to show what the names of each column are. The fact that it fails silently when the dtype is not specified is what was problematic. So my suggestion was to do one of the following:

1)add something in the docstring to note that dtype needs to be specified for the names argument to work 
2) to change the way genfromtxt works to default to dtype=None when the 'names' argument is invoked without a dtype being specified.
3) issue some sort of warning/error

>>> From the numpy help, there is this example:
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'), 
>>> ('mystring','S5')], delimiter=",")
>>>
>>> It does not help that the dtype of structured arrays also includes the 
>>> actual name. So I do not think we can use dtype argument without using 
>>> the combination of dtype and name. Perhaps if dtype is split into names 
>>> and formats so that dtype=('name', 'format').

I think when I was reading the help. I was immediately drawn to the 'names' argument as the part of the function that would do what I needed it to. It was only a while later that I read through things more completely and worked out the connection to 'dtype' and also the fact that I could specify the field names through the 'dtype' argument as well. To me the combination of dtype=None & names='x,y,z' is more useful because I can give each column a name but let numpy figure out the format automatically without having to specify each column manually.

- dharhas




More information about the SciPy-User mailing list