[Numpy-discussion] genloadtxt: second serving

Ryan May rmay31@gmail....
Thu Dec 4 14:54:28 CST 2008


Pierre GM wrote:
> All,
> Here's the second round of genloadtxt. That's a tad cleaner version than 
> the previous one, where I tried to take  into account the different 
> comments and suggestions that were posted. So, tabs should be supported 
> and explicit whitespaces are not collapsed.

Looks pretty good, but there's one breakage against what I had working 
with my local copy (with mods).  When adding the filtering of names read 
from the file using usecols, there's a reason I set a flag and fixed it 
later: converters specified by name.  If we have usecols and converters 
specified by name, and we read the names from a file, we have the 
following sequence:

1) Read names
2) Convert usecols names to column numbers.
3) Filter name list using usecols. Indices of names list no longer map 
to column numbers.
4) Change converters from mapping names->funcs to mapping col#->func 
using indices from names....OOPS.

It's an admittedly complex combination, but it allows flexibly reading 
text files since you're only basing on field names, no column numbers. 
Here's a test case:

     def test_autonames_usecols_and_converter(self):
         "Tests names and usecols"
         data = StringIO.StringIO('A B C D\n aaaa 121 45 9.1')
         test = loadtxt(data, usecols=('A', 'C', 'D'), names=True, 		 
          dtype=None, converters={'C':lambda s: 2 * int(s)})
         control = np.array(('aaaa', 90, 9.1),
             dtype=[('A', '|S4'), ('C', int), ('D', float)])
         assert_equal(test, control)

This fails with your current implementation, but works for me when:

1) Set a flag when reading names from header line in file
2) Filter names from file using usecols (if the flag is true) *after*
remapping the converters. There may be a better approach, but this is 
the simplest I've come up with so far.

> FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit 
> comparison: same input, no missing data, one with genloadtxt, one with 
> np.loadtxt and a last one with matplotlib.mlab.csv2rec.
> 
> As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but 
> twice faster than csv2rec. One of the explanation for the slowness is 
> indeed the use of classes for splitting lines and converting values. 
> Instead of a basic function, we use the __call__ method of the class, 
> which itself calls another function depending on the attribute values. 
> I'd like to reduce this overhead, any suggestion is more than welcome, 
> as usual.
> 
> Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in 
> numpy.ma, with an alias recfromcsv for John, using his defaults. Unless 
> somebody comes with a brilliant optimization.

Why only in numpy.ma and not somewhere in core numpy itself (missing 
values aside)?  You have a pretty good masked array agnostic wrapper 
that IMO could go in numpy, though maybe not as loadtxt.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma


More information about the Numpy-discussion mailing list