[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Vincent Nijs v-nijs@kellogg.northwestern....
Sun Jul 8 18:06:45 CDT 2007


Tim,

I do want to auto-detect. Reading numbers in as floats is probably not a
huge penalty. 

Is there an easy way to change the type of one column in a recarray that you
know?

I tried this:

ra.col1 = ra.col1.astype(Œi¹)

but that didn¹t seem to work. I assume that means you would have to create a
new array from the old one with an updated dtype list.

Thanks,

Vincent


On 7/8/07 4:51 PM, "Timothy Hochberg" <tim.hochberg@ieee.org> wrote:

> 
> 
> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>> Torgil,
>> 
>> The function seems to work well and is slightly faster than your previous
>> version (about 1/6th faster).
>> 
>> Yes, I do have columns that start with, what looks like, int's and then
>> turnTim,
>> out to be floats. Something like below (col6).
>> 
>>     data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
>>             ['1','3','1/97','1.12','2.11','0'],
>>             ['1','2','3/97',' 1.21','3.12','0'],
>>             ['2','1','2/97','1.12','2.11','0'],
>>             ['2','2','4/97','1.33','2.26',' 1.23'],
>>             ['2','2','5/97','1.73','2.42','1.26']]
>> 
>> I think what your function assumes is that the 1st element will be the
>> appropriate type. That may not hold if you have missing values or 'mixed
>> types'.
> 
> 
> Vincent,
> 
> Do you need to auto detect the column types? Things get a lot simpler if you
> have some known schema for each file; then you can simply pass that to some
> reader function. It's also more robust since there's no way in general to
> differentiate a column of integers from a column of floats with no decimal
> part. 
> 
> If you do need to auto detect, one approach would be to always read both
> int-like stuff and float-like stuff in as floats. Then after you get the array
> check over the various columns and if any have no fractional parts, make a new
> array where those columns are integers.
> 
>  -tim
> 
>> Best,
>> 
>> Vincent
>> 
>> 
>> On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
>> 
>>> > Hi
>>> >
>>> > I stumble on these types of problems from time to time so I'm
>>> > interested in efficient solutions myself.
>>> >
>>> > Do you have a column which starts with something suitable for int on
>>> > the first row (without decimal separator) but has decimals further
>>> > down?
>>> >
>>> > This will be little tricky to support. One solution could be to yield
>>> > StopIteration, calculate new type-conversion-functions and start over
>>> > iterating over both the old data and the rest of the iterator.
>>> >
>>> > It'd be great if you could try the load_gen_iter.py I've attached to
>>> > my response to Tim.
>>> >
>>> > Best Regards,
>>> >
>>> > //Torgil
>>> >
>>> > On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>>>> >> I am not (yet) very familiar with much of the functionality introduced
in
>>>> >> your script Torgil (izip, imap, etc.), but I really appreciate you
>>>> taking
>>>> >> the time to look at this!
>>>> >> 
>>>> >> The program stopped with the following error:
>>>> >>
>>>> >>   File "load_iter.py", line 48, in <genexpr>
>>>> >>     convert_row=lambda r: tuple(fn(x) for fn,x in
>>>> >> izip(conversion_functions,r))
>>>> >> ValueError: invalid literal for int() with base 10: '2174.875'
>>>> >>
>>>> >> A lot of the data I use can have a column with a set of int¹s (e.g.,
>>>> 0¹s),
>>>> >> but then the rest of that same column could be floats. I guess finding
>>>> the 
>>>> >> right conversion function is the tricky part. I was thinking about
>>>> sampling
>>>> >> each, say, 10th obs to test which function to use. Not sure how that
>>>> would
>>>> >> work however.
>>>> >>
>>>> >> If I ignore the option of an int ( i.e., everything is a float, date, or
>>>> >> string) then your script is about twice as fast as mine!!
>>>> >>
>>>> >> Question: If you do ignore the int's initially, once the rec array is in
>>>> >> memory, would there be a quick way to check if the floats could pass as
>>>> >> int's? This may seem like a backwards approach but it might be 'safer'
if
>>>> >> you really want to preserve the int's.
>>>> >>
>>>> >> Thanks again!
>>>> >>
>>>> >> Vincent 
>>>> >>
>>>> >>
>>>> >> On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
>>>> >>
>>>>> >>> Given that both your script and the mlab version preloads the whole
>>>>> >>> file before calling numpy constructor I'm curious how that compares in
>>>>> >>> speed to using numpy's fromiter function on your data. Using fromiter
>>>>> >>> should improve on memory usage (~50% ?).
>>>>> >>>
>>>>> >>> The drawback is for string columns where we don't longer know the
>>>>> >>> width of the largest item. I made it fall-back to "object" in this
>>>>> >>> case.
>>>>> >>>
>>>>> >>> Attached is a fromiter version of your script. Possible speedups could
>>>>> >>> be done by trying different approaches to the "convert_row" function,
>>>>> >>> for example using "zip" or "enumerate" instead of "izip".
>>>>> >>>
>>>>> >>> Best Regards,
>>>>> >>>
>>>>> >>> //Torgil
>>>>> >>>
>>>>> >>>
>>>>> >>> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu
>>>>> <mailto:v-nijs@kellogg.northwestern.edu> > wrote:
>>>>>> >>>> Thanks for the reference John! csv2rec is about 30% faster than my
>>>>>> code on
>>>>>> >>>> the same data.
>>>>>> >>>>
>>>>>> >>>> If I read the code in csv2rec correctly it converts the data as it
>>>>>> is being 
>>>>>> >>>> read using the csv modules. My setup reads in the whole dataset into
an
>>>>>> >>>> array of strings and then converts the columns as appropriate.
>>>>>> >>>>
>>>>>> >>>> Best, 
>>>>>> >>>>
>>>>>> >>>> Vincent
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
>>>>>> >>>>
>>>>>>> >>>>> On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>>>>>>>> >>>>>> I wrote the attached (small) program to read in a text/csv file
with 
>>>>>>>> >>>>>> different data types and convert it into a recarray without
>>>>>>>> having to
>>>>>>>> >>>>>> pre-specify the dtypes or variables names. I am just too lazy to
>>>>>>>> type-in
>>>>>>>> >>>>>> stuff like that :) The supported types are int, float, dates,
and 
>>>>>>>> >>>>>> strings.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> I works pretty well but it is not (yet) as fast as I would like
>>>>>>>> so I was
>>>>>>>> >>>>>> wonder if any of the numpy experts on this list might have some
>>>>>>>> >>>>>> suggestion
>>>>>>>> >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed
is
>>>>>>>> >>>>>> important
>>>>>>>> >>>>>> for me.
>>>>>>> >>>>> 
>>>>>>> >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>>>>>> >>>>> same.  You may want to compare implementations in case we can
>>>>>>> >>>>> fruitfully cross pollinate them.  In the examples directy, there
>>>>>>> is an 
>>>>>>> >>>>> example script examples/loadrec.py
>>>>>>> >>>>> _______________________________________________
>>>>>>> >>>>> Numpy-discussion mailing list
>>>>>>> >>>>> Numpy-discussion@scipy.org
>>>>>>> >>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>> >>>>>
>>>>>> >>>> 
>>>>>> >>>>
>>>>>> >>>> _______________________________________________
>>>>>> >>>> Numpy-discussion mailing list
>>>>>> >>>> Numpy-discussion@scipy.org  <mailto:Numpy-discussion@scipy.org>
>>>>>> >>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>>>> >>>>
>>>>> >>> _______________________________________________
>>>>> >>> Numpy-discussion mailing list
>>>>> >>> Numpy-discussion@scipy.org
>>>>> >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>> >>
>>>> >> --
>>>> >> Vincent R. Nijs
>>>> >> Assistant Professor of Marketing
>>>> >> Kellogg School of Management, Northwestern University
>>>> >> 2001 Sheridan Road, Evanston, IL 60208-2001
>>>> >> Phone: +1-847-491-4574 Fax: +1-847-491-2498
>>>> >> E-mail: v-nijs@kellogg.northwestern.edu
>>>> >> Skype: vincentnijs
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> Numpy-discussion mailing list
>>>> >> Numpy-discussion@scipy.org
>>>> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>> >>
>>> > _______________________________________________
>>> > Numpy-discussion mailing list
>>> > Numpy-discussion@scipy.org
>>> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>> <http://projects.scipy.org/mailman/listinfo/numpy-discussion>
>>> >
>> 
>> --
>> Vincent R. Nijs
>> Assistant Professor of Marketing
>> Kellogg School of Management, Northwestern University
>> 2001 Sheridan Road, Evanston, IL 60208-2001
>> Phone: +1-847-491-4574 Fax: +1-847-491-2498
>> E-mail: v-nijs@kellogg.northwestern.edu
>> Skype: vincentnijs
>> 
>> 
>> 
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> 
> 


-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs@kellogg.northwestern.edu
Skype: vincentnijs


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070708/02f80538/attachment-0001.html 


More information about the Numpy-discussion mailing list