[Numpy-discussion] fromfile() for reading text (one more time!)
Thu Jan 7 14:08:23 CST 2010
Pauli Virtanen wrote:
> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
> it also does odd things with spaces
>> embedded in the separator:
>> ", $ #" matches all of: ",$#" ", $#" ",$ #"
> That's a documented feature:
OK, I've written a patch that allows newlines to be interpreted as
separators in addition to whatever is specified in sep.
In the process of testing, I found again these issues, which are still
marked as "needs decision".
In short: what to do with missing values?
I'd like to address this bug, but I need a decision to do so.
Raise an ValueError with missing values.
No function should EVER return data that is not there. Period. It is
simply asking for hard to find bugs. Therefore:
fromstring("3, 4,,5", sep=",")
Should never, ever, return:
array([ 3., 4., 0., 5.])
Which is what it does now. bad. bad. bad.
A) Raising a ValueError is the easiest way to get "proper" behavior.
Folks can use a more sophisticated file reader if they want missing
values handled. I'm willing to contribute this patch.
B) If the dtype is a floating point type, NaN could fill in the
missing values -- a fine idea, but you can't use it for integers, and
zero is a really bad replacement!
C) The user could specify what they want filled in for missing
values. This is a fine idea, though I'm not sure I want to take the time
to impliment it.
Oh, and this is a bug too, with probably the same solution:
In : np.fromstring("hjba", sep=',')
Out: array([ 0.])
In : np.fromstring("34gytf39", sep=',')
Out: array([ 34.])
One more unresolved question:
np.fromstring("3, 4, 5,", sep=",")
it currently returns:
array([ 3., 4., 5.])
which seems a bit inconsitent with missing value handling. I also found
In : np.fromstring("3, 4, 5 , ", sep=",")
Out: array([ 3., 4., 5., 0.])
so if there is some extra whitespace in there, it does return a missing
value. With my proposal, that wouldn't happen, but you might get an
exception. I think you should, but it'll be easier to implement my
"allow newlines" code if not.
so, should I do (A) ?
I've got a patch mostly working (except for the above issues) that will
allow fromfile/string to read multiline non-whitespace separated data in
In : str
Out: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
In : np.fromstring(str, sep=',', allow_newlines=True)
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
I think this is a very helpful enhancement, and, as it is a new kwarg,
1) Might it be accepted for inclusion?
2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
but also long -- I used it for the flag name in the C code, too.
3) What C datatype should I use for a boolean flag? I used a char, but I
don't know what the numpy standard is.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
More information about the NumPy-Discussion