[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Vincent Nijs v-nijs@kellogg.northwestern....
Sun Jul 8 13:15:11 CDT 2007


I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!

The program stopped with the following error:

  File "load_iter.py", line 48, in <genexpr>
    convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'

A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.

If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.

Thanks again!

Vincent


On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:

> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
> 
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
> 
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".
> 
> Best Regards,
> 
> //Torgil
> 
> 
> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>> Thanks for the reference John! csv2rec is about 30% faster than my code on
>> the same data.
>> 
>> If I read the code in csv2rec correctly it converts the data as it is being
>> read using the csv modules. My setup reads in the whole dataset into an
>> array of strings and then converts the columns as appropriate.
>> 
>> Best,
>> 
>> Vincent
>> 
>> 
>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
>> 
>>> On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>>>> I wrote the attached (small) program to read in a text/csv file with
>>>> different data types and convert it into a recarray without having to
>>>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>>>> stuff like that :) The supported types are int, float, dates, and strings.
>>>> 
>>>> I works pretty well but it is not (yet) as fast as I would like so I was
>>>> wonder if any of the numpy experts on this list might have some suggestion
>>>> on how to speed it up. I need to read 500MB-1GB files so speed is important
>>>> for me.
>>> 
>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>> same.  You may want to compare implementations in case we can
>>> fruitfully cross pollinate them.  In the examples directy, there is an
>>> example script examples/loadrec.py
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>> 
>> 
>> 
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> 
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs@kellogg.northwestern.edu
Skype: vincentnijs





More information about the Numpy-discussion mailing list