[Numpy-discussion] fromfile() for reading text (one more time!)

Christopher Barker Chris.Barker@noaa....
Fri Jan 8 17:12:24 CST 2010


Bruce Southey wrote:
> Also a user has to check for missing
> values or numpy has to warn a user

I think warnings are next to useless for all but interactive work -- so 
I don't want to rely on them

> that missing values are present
> immediately after reading the data so the appropriate action can be
> taken (like using functions that handle missing values appropriately).
> That is my second problem with using codes (NaN, -99999 etc)  for
> missing values.

But I think you're right -- if someone write code, tests with good 
input, then later runs it with missing valued import, they are likely to 
have not ever bothered to test for missing values.

So I think missing values should only be replaced by something if the 
user specifically asks for it.

>> And the principle of fromfile() is that it is fast and simple, if you
>> want masked arrays, use slower, but more full-featured methods.
> 
> So in that case it should fail with missing data.

Well, I'm not so sure -- the point is performance, no reason not to have 
high performing code that handles missing data.

> What about '\r' and '\n\r'?

I have thought about that -- I'm hoping that python's text file reading 
will just take care of it, but as we're working with C file handles here 
(I think), I guess not. '/n/r' is easy -- the '/r' is just extra 
whitespace. 'r' is another case to handle.


> My problem with this is that you are reading one huge 1-D array  (that
> you can resize later) rather than a 2-D array with rows and columns
> (which is what I deal with).

That's because fromfile()) is not designed to be row-oriented at all, 
and the binary read certainly isn't. I'm just trying to make this easy 
-- though it's not turning out that way!

 > But I agree that you can have an option
> to say treat '\n' or '\r' as a delimiter but I think it should be
> turned off by default.

that's what I've done.

> You should have a corresponding value for ints because raising an
> exceptionwould be inconsistent with allowing floats to have a value.

I'm not sure I care, really -- but I think having the user specify the 
fill value is the best option, anyway.

josef.pktd@gmail.com wrote:
>>> none -- exactly why I think \n is a special case.
>> What about '\r' and '\n\r'?
> 
> Yes, I forgot about this, and it will be the most common case for
> Windows users like myself.
> 
> I think \r should be stripped automatically, like in non-binary
> reading of files in python.

except for folks like me that have old mac files laying around...so I 
want this like "Universal newlines" support.

> A warning would be good, but doing np.any(np.isnan(x)) or
> np.isnan(x).sum() on the result is always a good idea for a user when
> missing values are possibility.

right, but the issue is the user has to know that they are possible, and 
we all know how carefully we all read docs!

Thanks for your input -- I think I know what I'd like to do, but it's 
proving less than trivial to do it, so we'll see.

In short:

1) optionally allow newlines to serve as a delimiter, so large tables 
can be read.

2) raise an exception for missing values, unless:
   3) the user specifies a fill value of their choice (compatible with 
the chosen data type.


-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov


More information about the NumPy-Discussion mailing list