[Numpy-discussion] parsing text strings/files in fromfile, fromstring

Christopher Barker Chris.Barker@noaa....
Tue May 26 01:30:09 CDT 2009


Charles R Harris wrote:
> I am trying to put together some rule for parsing text strings/files in 
> fromfile, fromstring so that the two are consistent.

Thanks for giving these some attention -- they've needed it for a while!

> 1) When the string/file is empty fromfile returns and empty array, split 
> returns an empty string, 

I think the behavior of split() is irrelevant here -- fromstring/file is 
about reading numbers from text -- while split()- is very helpful for 
that, it's not what it's specifically for.

> and fromstring converts the empty string to a 
> default value. Which should we use?

they should NEVER return a number when there isn't one in the source.

> 2) When the string/file contains only a single separator 
> fromfile/fromstring both return a single value, while split returns two 
> empty strings. Which should we use?

neither -- see above.

> My preferences would be to return empty arrays whenever the string/file 
> is empty, but I don't feel strongly about that.

yup.

> Also, wouldn't a missing value be better interpreted as nan than zero in 
> the float case?

yes, but since I don't think missing values should be returned at all, 
it doesn't matter. I do think the more interesting case might be a csv 
or tab-delimited file with a line like:

34, 5, 4.6, , , 45, 32

In this case, I suppose it is clear that this is a row in a table that 
is supposed to be 7 items long. With floats, it would be pretty rational 
to put NaNs in there, but without an equivalent for integers, I'd say go 
with an error. Two other options:

1) a "missing_value" keyword -- the user explicitly says what they want 
put in for missing values.

2) return a masked array -- also as a keyword option. Masked arrays are 
supposed to be the numpy way to express missing values.

and yes, fromfile( a_file ) should always return the same thing as 
fromstring( a_file.read() )

Pauli Virtanen wrote:
> a) fromstring("1,2,x,4", sep=",") -> [1,2]
>    fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
>    fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
>    fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
> 
> b) fromstring("1,2,x,4", sep=",") -> [1,2]
>    fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
>    fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
>    fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
>    fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
> 
> c) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
>    fromstring("1,2,x,4", sep=",", count=5) -> [1,2] + SomeWarning
> 
> d) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
>    fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
>    fromstring("1,2,x,4", sep=",", default=3, count=5) -> [1,2,3,4] + SomeWarning
> 
> e) fromstring("1,2,x,4", sep=",") -> ValueError
>    fromstring("1,2,x,4", sep=",", strict=False) -> [1,2]
>    fromstring("1,2,x,4", sep=",", count=5) -> ValueError
>    fromstring("1,2,x,4", sep=",", count=5, strict=False) -> [1,2]

(c) and (d) are out, as I don't think Warnings are the right thing here 
(see my earlier rant).

I don't like (a) and (b), as I think "strict" (with a better name...) 
should be True be default. What I want is (b) with a different default, 
which would be (e) with a "default" (or, maybe "missing"). Those seem to 
have defined "strict" in two ways: both number of elements, and what to 
do with non-numerical input, I wonder if those should be merged? Also, I 
wonder if setting a "missing" should work for any non-numerical entires, 
or only empty space?


I think I'd go with:

f) fromstring("1,2,x,4", sep=",") -> [1,2]
    fromstring("1,2,x,4", sep=",", count=4) -> ValueError
    fromstring("1,2,3,4", sep=",", count=5) -> ValueError
    fromstring("1,2,3,4", sep=",", count=5, strict=False) -> [1,2,3,4]
    fromstring("1,2, ,4", sep=",", missing=3) -> [1,2,3,4]
    fromstring("1,2,x,4", sep=",", missing=3) -> ValueError


I THINK we can break it down into these distinct questions:

(1) What should be returned if there is a non-number between separators 
and there is no default value specified?
        a) ValueError
        b) a default value

(2) If a default value was specified:
        a) the default value
        b) if it is whitespace:
              the default
           else:
              ValueError

(3) What should be returned if EOF is reached before count is reached?
    (a) a warning
    (b) just the numbers read so far
    (c) if strict:
           an exception
        else:
           just the numbers read so far

(4) Should any non-numeric text behave the same as EOF when count is not 
specified?
      (a) yes
      (b) no

(5) what should "strict" default to?
      (a) True
      (b) False

(6) Should \n be interpreted as a sep along with the specified sep?
      (a) yes
      (b) no
[OK, I added that one as my pet desire...)

I vote:
(1) a
(2) b
(3) c
(4) b
(5) a
(6) a

Does that cover it?

> and binary data implied by sep='' would be interpreted in the same
> way it would if first converted to comma-separated text.

Only with regard to less than count numbers read -- I don't think any of 
the rest applies -- though I'm still for splitting binary and text file 
reading anyway.

> I'd vote for (e) if the slate was clean, but since it's not:

I think the slate is clean enough, given that the current implementation 
is buggy.

While you are digging into this code, we did have a discussion a while 
back, captured in this ticket:

http://projects.scipy.org/numpy/ticket/909

Any chance you could address any of that, too?

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception


More information about the Numpy-discussion mailing list