[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson torgil.svensson@gmail....
Thu Jul 19 06:34:51 CDT 2007


Hi again,

On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:

> If memory really is an issue, you have the nice "load_spec" version
> and can always convert the files once by iterating over the file twice
> like the attached script does.

I discovered that my script was broken and too complex. The attached
script is much cleaner and has better error messages.

Best regards,

//Torgil


On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
> Hi,
>
> 1. Your code is fast due to that you convert whole at once columns in
> numpy. The first step with the lists is also very fast (python
> implements lists as arrays). I like your version, I think it's as fast
> as it gets in pure python and has to keep only two versions of the
> data at once in memory (since the string versions can be garbage
> collected).
>
> If memory really is an issue, you have the nice "load_spec" version
> and can always convert the files once by iterating over the file twice
> like the attached script does.
>
>
> 4. Okay, that makes sense. I was confused by the fact that your
> generated function had the same name as the builtin iter() operator.
>
>
> //Torgil
>
>
> On 7/19/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> >
> >  Hi Torgil,
> >
> >  1. I got an email from Tim about this issue:
> >
> >  "I finally got around to doing some more quantitative comparisons between
> > your code and the more complicated version that I proposed. The idea behind
> > my code was to minimize memory usage -- I figured that keeping the memory
> > usage low would make up for any inefficiencies in the conversion process
> > since it's been my experience that memory bandwidth dominates a lot of
> > numeric problems as problem sized get reasonably large. I was mostly wrong.
> > While it's true that for very large file sizes I can get my code to
> > outperform yours, in most instances it lags behind. And the range where it
> > does better is a fairly small range right before the machine dies with a
> > memory error. So my conclusion is that the extra hoops my code goes through
> > to avoid allocating extra memory isn't worth it for you to bother with."
> >
> >  The approach in my code is simple and robust to most data issues I could
> > come-up with. It actually will do an appropriate conversion if there are
> > missing values or int's and float in the same column.  It will select an
> > appropriate string length as well. It may not be the most memory efficient
> > setup but given Tim's comments it is a pretty decent solution for the types
> > of data I have access to.
> >
> >  2. Fixed the spelling error :)
> >
> >  3. I guess that is the same thing. I am not very familiar with zip, izip,
> > map etc. just yet :) Thanks for the tip!
> >
> >  4. I called the function generated using exec, iter(). I need that function
> > to transform the data using the types provided by the user.
> >
> >  Best,
> >
> >  Vincent
> >
> >
> >  On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
> >
> >  > Nice,
> >  >
> >  > I haven't gone through all details. That's a nice new "missing"
> >  > feature, maybe all instances where we can't find a conversion should
> >  > be "nan". A few comments:
> >  >
> >  > 1. The "load_search" functions contains all memory/performance
> >  > overhead that we wanted to avoid with the fromiter function. Does this
> >  > mean that you no longer have large text-files that change sting
> >  > representation in the columns (aka "0" floats) ?
> >  >
> >  > 2. ident=" "*4
> >  > This has the same spelling error as in my first compile try .. it was
> >  > meant to be "indent"
> >  >
> >  > 3. types = list((i,j) for i, j in zip(varnm, types2))
> >  > Isn't this the same as "types = zip(varnm, types2)" ?
> >  >
> >  > 4.  return N.fromiter(iter(reader),dtype = types)
> >  > Isn't "reader" an iterator already? What does the "iter()" operator do
> >  > in this case?
> >  >
> >  > Best regards,
> >  >
> >  > //Torgil
> >  >
> >  >
> >  > On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> >  >>
> >  >>  I combined some of the very useful comments/code from Tim and Torgil
> > and
> >  >> came-up with the attached program to read csv files and convert the data
> >  >> into a recarray. I couldn't use all of their suggestions because,
> > frankly, I
> >  >> didn't understand all of them :)
> >  >>
> >  >>  The program use variable names if provided in the csv-file and can
> >  >> auto-detect data types. However, I also wanted to make it easy to
> > specify
> >  >> data types and/or variables names if so desired. Examples are at the
> > bottom
> >  >> of the file. Comments are very welcome.
> >  >>
> >  >>  Thanks,
> >  >>
> >  >>  Vincent
> >  >> _______________________________________________
> >  >> Numpy-discussion mailing list
> >  >> Numpy-discussion@scipy.org
> >  >>
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >  >>
> >  >>
> >  >>
> >  > _______________________________________________
> >  > Numpy-discussion mailing list
> >  > Numpy-discussion@scipy.org
> >  >
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >  >
> >
> >  --
> >  Vincent R. Nijs
> >  Assistant Professor of Marketing
> >  Kellogg School of Management, Northwestern University
> >  2001 Sheridan Road, Evanston, IL 60208-2001
> >  Phone: +1-847-491-4574 Fax: +1-847-491-2498
> >  E-mail: v-nijs@kellogg.northwestern.edu
> >  Skype: vincentnijs
> >
> >
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_tricky_columns.py
Type: text/x-python
Size: 2667 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070719/5ebbc4ea/attachment.py 


More information about the Numpy-discussion mailing list