[Numpy-tickets] [NumPy] #909: fromstring() / froomfile Enhancments

NumPy numpy-tickets@scipy....
Wed Sep 10 00:38:13 CDT 2008


#909: fromstring() / froomfile Enhancments
-------------------------+--------------------------------------------------
 Reporter:  ChrisBarker  |       Owner:  somebody
     Type:  enhancement  |      Status:  new     
 Priority:  normal       |   Milestone:  1.3.0   
Component:  numpy.core   |     Version:  none    
 Severity:  normal       |    Keywords:          
-------------------------+--------------------------------------------------
 == Proposed Enhancements and bug fixes for `fromfile()` and `fromstring()`
 text handling: ==

  === Motivation: ===

 The goal of the `fromfile()` text file handling capability is to enable
 users to write code that can read a lot of numbers from a text file into
 an array. Python provides a lot of nifty text processing capabilities, and
 there are a number of higher level facilities for reading blocks of data
 (including `numpy.loadtxt`). These are very capable, but there really is a
 significant performance hit, at least when loading 10s of thousands of
 numbers into a file.

 We don't want to write all of `loadtxt(`) and friends in C. Rather, the
 goal is to allow the simple cases to be done very efficiently, and
 hopefully fancier text reading packages can build on it to add more
 features.

 Unfortunately, the current (numpy version 1.2) version has a few bugs and
 limitations that keep of from being nearly as useful as it could be.

  === Possible features: ===

  *  Create `fromtextfile()` and `fromtextstring` functions, distinct from
 `fromfile()` and `fromstring()`. It really is a different functionality.
 `fromfile(`) could still call `fromtextfile()` for backward compatibility.

  *  Allow more than one separator? for example, a comma or whitespace? In
 the general case, the user could perhaps specify any number of separators,
 though I doubt that would be useful in practice. At the very least,
 however, `fromtextfile()` should support reading files that look like:
 {{{
 43.5, 345.6, 123.456, 234.33
 34.5, 22.57, 2345,  2345, 252
 ...
 }}}
 That is, comma separated, but being able to read multiple lines in one
 shot.

 The easiest way to support that would probably be to always allow
 whitespace as a separator, and add the one passed in. I can't think of a
 reason not to do this, but maybe I'm not very imaginative.

  * Allow the user to specify a shape for the output array. There may be
 little point, as all this does is save a calls to reshape(), but it may be
 another way to support the above. i.e. you could read that data with:

  `a = np.fromtextfile(infile, dtype=np.float, sep=',', shape=(-1, 4))`

  Then it would know to skip the newlines every 4 elements.

  *  Allow the user to specify a comment string. The reader would then skip
 everything in the file between the comment string and a newline. Maybe
 Universal newline -- any of \r, \n or \r\n. Or simply expect that the user
 has opened the file with mode 'U' if they want that. This could also be
 extended to support C-style comments with an opening and closing character
 sequence, but that's a lot less common.

  *  Allow the user to specify a Locale. It may be best to be able to
 specify a locale, rather than relying on the system on (whether '.' or ','
 is the decimal separator, for instance. (ticket #884)

  * parsing of "Inf" and the like that doesn't depend on system (ticket
 #510). This would be nice, but maybe too difficult -- would we need to
 write our own `scanf`?


  === Bugs to be fixed: ===

  * `fromfile()` and `fromstring` handling malformed data poorly: ticket
 #883

  * Any others?

-- 
Ticket URL: <http://scipy.org/scipy/numpy/ticket/909>
NumPy <http://projects.scipy.org/scipy/numpy>
The fundamental package needed for scientific computing with Python.


More information about the Numpy-tickets mailing list