[Numpy-discussion] data type specification when using numpy.genfromtxt

Derek Homeier derek@astro.physik.uni-goettingen...
Mon Jun 27 19:14:00 CDT 2011


Hi Chao,

by mistake did not reply to the list last time...

On 27.06.2011, at 10:30PM, Chao YUE wrote:
Hi Derek!
> 
> I tried with the lastest version of python(x,y) package with numpy version of 1.6.0. I gave the data to you with reduced columns (10 column) and rows.
> 
> b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] + [ float for n in range(9)]) works.
> if you change  usecols=tuple(range(10))  to usecols=range(10), it still works.
> 
> b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) works.
> 
> but b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] + [ float for n in range(9)]) didn't work. 
> 
> I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows 32-bit system.
> 
> Please don't spend too much time on this if it's not a potential problem.
> 
OK, dtype=None works on 1.6.0, that's the important bit. 
From your example file it seems the dtype list does work not without specifying usecols, because your header contains and excess semicolon in the field "Air temperature (High; HMP45C)", thus genfromtxt expects more data columns than actually exist. If you replace the semicolon you should be set (or, if I may suggest, write another header line with catchier field names so you don't have to work with array fields like "b['Water vapor density by LiCor 7500']"  ;-). 
Otherwise both options work for me with python2.6+numpy-1.5.1 as well as 1.6.0/1.6.1rc1. 

I am curious though why your python interpreter gave this error message: 
> ValueError                                Traceback (most recent call last)
> 
> D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in <module>()
> 
> C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, dtype, co
> mments, delimiter, skiprows, skip_header, skip_footer, converters, missing, miss
> ing_values, filling_values, usecols, names, excludelist, deletechars, replace_sp
> ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_rais
> e)
>    1449             # Raise an exception ?
> 
>    1450             if invalid_raise:
> -> 1451                 raise ValueError(errmsg)
>    1452             # Issue a warning ?
> 
>    1453             else:
> 
> ValueError

since ipython2.6 on my Mac reported this:
...
   1450             if invalid_raise:
-> 1451                 raise ValueError(errmsg)
   1452             # Issue a warning ?

   1453             else:

ValueError: Some errors were detected !
    Line #3 (got 10 columns instead of 11)
    Line #4 (got 10 columns instead of 11)
etc....
which of course provided the right lead to the problem - was the actual errmsg really missing, or did you cut the message too soon?

> the final thing is, when I try to do this (I want to try the missing_values in numpy 1.6.0), it gives error:  
> 
> In [33]: import StringIO as StringIO
> 
> In [34]: data = "1, 2, 3\n4, 5, 6"
> 
> In [35]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missing_values=2)
> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
> 
> D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in <module>()
> 
> TypeError: 'module' object is not callable
> 
You want to use "from StringIO import StringIO" (or write "StringIO.StringIO(data)". 
But again, this will not work the way you expect it to with int/float numbers set as missing_values, and reading to regular arrays. I've tested this on 1.6.1 and the current development branch as well, and the missing_values are only considered for masked arrays. This is not likely to change soon, and may actually be intentional, so to process those numbers on read-in, your best option would be to define a custom set of "converters=conv" as shown in my last mail.

Cheers,
							Derek

> 2011/6/27 Derek Homeier <derek@astro.physik.uni-goettingen.de>
> Hi Chao,
> 
> this seems to have become quite a number of different issues!
> But let's make sure I understand what's going on...
> 
> > Thanks very much for your quick reply. I make a short summary of what I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works when you explicitly specify the columns to be read, and genfromtxt cannot automatically determine the type if you don't specify the type....
> >
> 
> > In [164]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] + [ float for n in range(48)])
> ...
> > But if I use the following, it gives error:
> >
> > In [171]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S
> > 10'] + [ float for n in range(48)])
> > ---------------------------------------------------------------------------
> > ValueError                                Traceback (most recent call last)
> >
> And the above (without the usecols) did work if you explicitly typed dtype=('S10', float, float....)? That by itself would be quite weird, because the two should be completely equivalent.
> What happens if you cast the generated list to a tuple - dtype=tuple(['S10'] + [ float for n in range(48)])?
> If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you please file a bug report with complete machine info etc.? But I suspect this might be an older version, you should also be able to simply use 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce this behaviour with the current numpy version.
> 
> > If I don't specify the dtype, it will not recognize the type of the first column (it displays as nan):
> >
> > In [172]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2))
> >
> > In [173]: b
> > Out[173]:
> > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997),
> >        (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0),
> >        (nan, -999.0, -999.0), (nan, -999.0, -999.0)],
> >       dtype=[('TIMESTAMP', '<f8'), ('CO2_flux', '<f8'), ('Net_radiation', '<f8')
> > ])
> >
> You _do_ have to specify 'dtype=None', since the default is 'dtype=float', as I have remarked in my previous mail. If this does not work, it could be a matter of the numpy version gain - there were a number of type conversion issues fixed between 1.5.1 and 1.6.0.
> >
> > Then the final question is, actually the '-999.0' in the data is missing value, but I cannot display it as 'nan' by specifying the missing_values as '-999.0':
> > but either I set the missing_values as -999.0 or using a dictionary, it neither work...
> ...
> >
> > Even this doesn't work (suppose 2 is our missing_value),
> > In [184]: data = "1, 2, 3\n4, 5, 6"
> >
> > In [185]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missin
> > g_values=2)
> > Out[185]:
> > array([(1, 2, 3), (4, 5, 6)],
> >       dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
> 
> OK, same behaviour here - I found the only tests involving 'valid numbers' as missing_values use masked arrays; for regular ndarrays they seem to be ignored. I don't know if this is by design - the question is, what do you need to do with the data if you know ' -999' always means a missing value? You could certainly manipulate them after reading in...
> If you have to convert them already on reading in, and using np.mafromtxt is not an option, your best bet may be to define a custom converter like (note you have to include any blanks, if present)
> 
> conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in range(1,49)))
> 
> Cheers,
>                                                Derek
> 
> 
> 
> 
> -- 
> ***********************************************************************************
> Chao YUE
> Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
> UMR 1572 CEA-CNRS-UVSQ
> Batiment 712 - Pe 119
> 91191 GIF Sur YVETTE Cedex
> Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16
> ************************************************************************************
> 
> <99Burn2003all_new.csv>



More information about the NumPy-Discussion mailing list