[Numpy-discussion] How to implement a 'pivot table?'

Timothy Hochberg tim.hochberg@ieee....
Tue Jul 31 09:40:53 CDT 2007


On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
>
> Hi Timothy,
>
> On 7/30/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
> >
> >
> > On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com > wrote:
> > > Hi Everyone,
> > >
> > > I am wondering what is the best (and fast) way to build a pivot table
> > > aside from the 'brute force way?'
> >
> > What's the brute force way? It's easier to offer an improved suggestion
> if
> > we know what we're trying to beat.
> >
> > > I want to transform an numpy array into a pivot table. For example, if
> > > I have a numpy array like below:
> > >
> > > Region     Date          # of Units
> > > ----------    ----------        --------------
> > > East        1/1             10
> > > East        1/1             20
> > > East        1/2             30
> > > West       1/1             40
> > > West       1/2             50
> > > West       1/2             60
> > >
> > > I want  to transform this into the following table, where f() is a
> > > given aggregate function:
> > >
> > >            Date
> > > Region           1/1          1/2
> > > ----------
> > > East         f(10,20)         f(30)
> > > West        f(40)             f(50,60)
> > >
> > >
> > > I can regroup them into 'sets' and do it the brute force way, but that
> > > is kind of slow to execute. Does anyone know a better way?
> >
> > I would use a python to dictionary to assemble lists of values. I would
> key
> > off (region/date) tuples. In outline:
> >
> > map = {}
> > dates = set()
> > regions = set()
> > for (region, date, units) in data:
> >     dates.add(date)
> >     regions.add(regions)
> >     key = (region, date)
> >     if key not in map:
> >          map[key] = []
> >     map[key].append(data)
> >
> > Once you have map, regions and dates, you can trivially make a table as
> > above.  The details will depend on what format you want the table to
> have,
> > but it should be easy to do.
> >


[SNIP]

The 'brute-force' way is basically what you suggested -- looping
> through all the records and building a two-way hash-table of the data.
>
> The problem of the brute-force' approach is that it is not taking
> advantage of facilities of numpy and can be slow in speed. If only
> there is some built-in mechanism in numpy to handle this.


I'm curious; have you tried this and found it slow, or is this a hunch based
on the reputed slowness of Python? Algorithmically, you can't do much
better: the dictionary and set operations are O(1), so the whole operation
is O(N), and you won't do any better than that, order wise. What your left
with is trying to reduce constant factors.

There are various ways one might go about reducing constant factors, but
they depend on the details of the problem. For example, if the dates are
dense and you are going to parse them anyway, you could replace the hash
with table that you index into with the date as an integer. I'm not sure
that you are going to do a lot better than the brute force algorithm in the
generic force case though.

-tim



The other thing I am not sure is in your map object above, do I append
> the row number to the numpy array or do I append the row object (such
> as data[r])?
>
> Thanks,
> Geoffrey
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
.  __
.   |-\
.
.  tim.hochberg@ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070731/d3820073/attachment-0001.html 


More information about the Numpy-discussion mailing list