[Numpy-discussion] Possible roadmap addendum: building better text file readers

Paul Anton Letnes paul.anton.letnes@gmail....
Thu Feb 23 23:46:00 CST 2012


As others on this list, I've also been confused a bit by the prolific numpy interfaces to reading text. Would it be an idea to create some sort of object oriented solution for this purpose?

reader = np.FileReader('my_file.txt')
reader.loadtxt() # for backwards compat.; np.loadtxt could instantiate a reader and call this function if one wants to keep the interface
reader.very_general_and_typically_slow_reading(missing_data=True)
reader.my_files_look_like_this_plz_be_fast(fmt='%20.8e', separator=',', ncol=2)
reader.cvs_read() # same as above, but with sensible defaults
reader.lazy_read() # returns a generator/iterator, so you can slice out a small part of a huge array, for instance, even when working with text (yes, inefficient)
reader.convert_line_by_line(myfunc) # line-by-line call myfunc, letting the user somehow convert easily to his/her format of choice: netcdf, hdf5, ... Not fast, but convenient

Another option is to create a hierarchy of readers implemented as classes. Not sure if the benefits outweigh the disadvantages.

Just a crazy idea - it would at least gather all the file reading interfaces into one place (or one object hierarchy) so folks know where to look. The whole numpy namespace is a bit cluttered, imho, and for newbies it would be beneficial to use submodules to a greater extent than today - but that's a more long-term discussion.

Paul


On 23. feb. 2012, at 21:08, Travis Oliphant wrote:

> This is actually on my short-list as well --- it just didn't make it to the list. 
> 
> In fact, we have someone starting work on it this week.  It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. 
> 
> Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no interest in making a new one if we can avoid it.   But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
> 
> -Travis
> 
> 
> 
> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
> 
>> Hi,
>> 
>> 23.02.2012 20:32, Wes McKinney kirjoitti:
>> [clip]
>>> To be clear: I'm going to do this eventually whether or not it
>>> happens in NumPy because it's an existing problem for heavy
>>> pandas users. I see no reason why the code can't emit structured
>>> arrays, too, so we might as well have a common library component
>>> that I can use in pandas and specialize to the DataFrame internal
>>> structure.
>> 
>> If you do this, one useful aim could be to design the code such that it
>> can be used in loadtxt, at least as a fast path for common cases. I'd
>> really like to avoid increasing the number of APIs for text file loading.
>> 
>> -- 
>> Pauli Virtanen
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list