[Numpy-discussion] fromfile() for reading text (one more time!)

Christopher Barker Chris.Barker@noaa....
Mon Jan 4 19:05:30 CST 2010


Hi folks,

I'm taking a look once again at fromfile() for reading text files. I 
often have the need to read a LOT of numbers form a text file, and it 
can actually be pretty darn slow do i the normal python way:

for line in file:
    data = map(float, line.strip().split())


or various other versions that are similar. It really does take longer 
to read the text, split it up, convert to a number, then put that number 
into a numpy array, than it does to simply read it straight into the array.

However, as it stands, fromfile() turn out to be next to useless for 
anything but whitespace separated text. Full set of ideas here:

http://projects.scipy.org/numpy/ticket/909

However, for the moment, I'm digging into the code to address a 
particular problem -- reading files like this:

123, 65.6, 789
23,  3.2,  34
...

That is comma (or whatever) separated text -- pretty common stuff.

The problem with the current code is that you can't read more than one 
line at time with fromfile:

a = np.fromfile(infile, sep=",")

will read until it doesn't find a comma, and thus only one line, as 
there is no comma after each line. As this is a really typical case, I 
think it should be supported.

Here is the question:

The work of finding the separator is done in:

multiarray/ctors.c:  fromfile_skip_separator()

It looks like it wouldn't be too hard to add some code in there to look 
for a newline, and consider that a valid separator. However, that would 
break backward compatibility. So maybe a flag could be passed in, saying 
you wanted to support newlines. The problem is that flag would have to 
get passed all the way through to this function (and also for fromstring).

I also notice that it supports separators of arbitrary length, which I 
wonder how useful that is. But it also does odd things with spaces 
embedded in the separator:

", $ #" matches all of:  ",$#"   ", $#"  ",$ #"

Is it worth trying to fix that?


In the longer term, it would be really nice to support comments as well, 
tough that would require more of a re-factoring of the code, I think 
(though maybe not -- I suppose a call to fromfile_skip_separator() could 
look for a comment character, then if it found one, skip to where the 
comment ends -- hmmm.

thanks for any feedback,

-Chris







-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov


More information about the NumPy-Discussion mailing list