[Numpy-discussion] fromfile() for reading text (one more time!)

josef.pktd@gmai... josef.pktd@gmai...
Thu Jan 7 23:26:41 CST 2010


On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
> <Chris.Barker@noaa.gov> wrote:
>> Bruce Southey wrote:
>>>> <Chris.Barker@noaa.gov> wrote:
>>
>>> Using the numpy NaN or similar (noting R's approach to missing values
>>> which in turn allows it to have the above functionality) is just a
>>> very bad idea for missing values because you always have to check that
>>> which NaN is a missing value and which was due to some numerical
>>> calculation.
>>
>> well, this is specific to reading files, so you know where it came from.
>
> You can only know where it came from when you compare the original
> array to the transformed one. Also a user has to check for missing
> values or numpy has to warn a user that missing values are present
> immediately after reading the data so the appropriate action can be
> taken (like using functions that handle missing values appropriately).
> That is my second problem with using codes (NaN, -99999 etc)  for
> missing values.
>
>
>
>> And the principle of fromfile() is that it is fast and simple, if you
>> want masked arrays, use slower, but more full-featured methods.
>
> So in that case it should fail with missing data.
>
>>
>> However, in this case:
>>
>> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
>> Out[9]: array([  3.,   4.,  NaN,   5.])
>>
>>
>> An actual NaN is read from the file, rather than a missing value.
>> Perhaps the user does want the distinction, so maybe it should really
>> only fil it in if the users asks for it, but specifying
>> "missing_value=np.nan" or something.
>
> Yes, that is my first problem of using predefined codes for missing
> values as you do not always know what is going to occur in the data.
>
>
>>
>>>>From what I can see is that you expect that fromfile() should only
>>> split at the supplied delimiters, optionally(?) strip any whitespace
>>
>> whitespace stripping is not optional.
>>
>>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>> actually assumes multiple delimiters because there is no comma between
>>> 4 and 5 and 8 and 9.
>>
>> Yes, that's the point. I thought about allowing arbitrary multiple
>> delimiters, but I think '/n' is a special case - for instance, a comma
>> at the end of some numbers might mean missing data, but a '\n' would not.
>>
>> And I couldn't really think of a useful use-case for arbitrary multiple
>> delimiters.
>>
>>> In Josef's last case how many 'missing values should there be?
>>
>>  >> extra newlines at end of file
>>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>>
>> none -- exactly why I think \n is a special case.
>
> What about '\r' and '\n\r'?

Yes, I forgot about this, and it will be the most common case for
Windows users like myself.

I think \r should be stripped automatically, like in non-binary
reading of files in python.

>
>>
>> What about:
>>  >> extra newlines in the middle of the file
>>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
>>
>> I think they should be ignored, but I hope I'm not making something that
>> is too specific to my personal needs.
>
> Not really, it is more that I am being somewhat difficult to ensure I
> understand what you actually need.
>
> My problem with this is that you are reading one huge 1-D array  (that
> you can resize later) rather than a 2-D array with rows and columns
> (which is what I deal with). But I agree that you can have an option
> to say treat '\n' or '\r' as a delimiter but I think it should be
> turned off by default.
>
>
>>
>> Travis Oliphant wrote:
>>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>>> use sscanf with weave to read most files.
>>
>> right -- but that requires weave. In fact, MATLAB has a fscanf function
>> that allows you to pass in a C format string and it vectorizes it to use
>> the same one over an over again until it's done. It's actually quite
>> powerful and flexible. I once started with that in mind, but didn't have
>> the C chops to do it. I ended up with a tool that only did doubles (come
>> to think of it, MATLAB only does doubles, anyway...)
>>
>> I may some day write a whole new C (or, more likely, Cython) function
>> that does something like that, but for now, I'm jsut trying to get
>> fromfile to be useful for me.
>>
>>
>>> +1   (much preferrable to insert NaN or other user value than raise
>>> ValueError in my opinion)
>>
>> But raise an error for integer types?
>>
>> I guess this is still up the air -- no consensus yet.
>>
>> Thanks,
>>
>> -Chris
>>
>
> You should have a corresponding value for ints because raising an
> exceptionwould be inconsistent with allowing floats to have a value.

No, I think different nan/missing value handling between integers and
float is a natural distinction. There is no default nan code for
integers, but nan (and inf) are valid floating point numbers (even if
nan is not a number). And the default treatment of nans in numpy is
getting pretty good (e.g. I like the new (nan)sort).


> If you must keep the user defined dtype then, as Josef suggests, just
> use some code be it -999 or most negative number supported by the OS
> for the defined dtype or, just convert the ints into floats if the
> user does not define a missing value code.  It would be nice to either
> return the number of missing values or display a warning indicating
> how many occurred.

A warning would be good, but doing np.any(np.isnan(x)) or
np.isnan(x).sum() on the result is always a good idea for a user when
missing values are possibility.

Josef

>
> Bruce
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list