[Numpy-discussion] memory-efficient loadtxt

Paul Anton Letnes paul.anton.letnes@gmail....
Wed Oct 3 10:58:42 CDT 2012


On 3. okt. 2012, at 17:48, Wes McKinney wrote:

> On Monday, October 1, 2012, Chris Barker wrote:
> Paul,
> 
> Nice to see someone working on these issues, but:
> 
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.
> 
> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.
> 
> I also have a Cython version that is not quite done (darn regular job
> getting in the way) that is both faster and more memory efficient.
> 
> Also, frankly, just writing the array pre-allocation and re-sizeing
> code into loadtxt would not be a whole lot of code either, and would
> be both fast and memory efficient.
> 
> Let mw know if you want any of my code to play with.
> 
> >  However, I got the impression that someone was
> > working on a More Advanced (TM) C-based file reader, which will
> > replace loadtxt;
> 
> yes -- I wonder what happened with that? Anyone?
> 
> -CHB
> 
> 
> 
> this patch is intended as a useful thing to have
> > while we're waiting for that to appear.
> >
> > The patch passes all tests in the test suite, and documentation for
> > the kwarg has been added. I've modified all tests to include the
> > seekable kwarg, but that was mostly to check that all tests are passed
> > also with this kwarg. I guess it's bit too late for 1.7.0 though?
> >
> > Should I make a pull request? I'm happy to take any and all
> > suggestions before I do.
> >
> > Cheers
> > Paul
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> --
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> 
> Chris.Barker@noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> I've finally built a new, very fast C-based tokenizer/parser with type inference, NA-handling, etc. for pandas sporadically over the last month-- it's almost ready to ship. It's roughly an order of magnitude faster than loadtxt and uses very little temporary space. Should be easy to push upstream into NumPy to replace the innards of np.loadtxt if I can get a bit of help with the plumbing (it already yields structured arrays in addition to pandas DataFrames so there isn't a great deal that needs doing). 
> 
> Blog post with CPU and memory benchmarks to follow-- will post a link here. 
> 
> - Wes
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


So Chris, looks like Wes has us beaten in every conceivable way. Hey, that's a good thing :)  I suppose the thing to do now is to make sure Wes' function tackles the loadtxt test suite?

Paul



More information about the NumPy-Discussion mailing list