[Numpy-discussion] Emulate left outer join?

Robert Kern robert.kern@gmail....
Tue Feb 9 16:02:48 CST 2010


On Tue, Feb 9, 2010 at 15:52, David Carmean <dlc@halibut.com> wrote:
>
> Hi,
>
> I've been working with numpy for less than a month, having learned about
> it after finding matplotlib.  My foundation in things like set theory is...
> weak to nonexistent, so I need a little help mapping sql-like thoughts into
> set-theory thinking :)
>
>
> Some context to help me explain:  I'm trying to store, chart, and analyze
> unix system performance data (sar/sadf output).  On a typical system I have
> about 75 fields/variables, all floats, with identical timestamps... or so
> we hope.   What I want to do in order to save memory/disk space is to stack
> the timeseries data all into three or four different arrays, and use a single
> timestamp field for each set.
>
> My problem is: I don't know that I can guarantee that the shape of all the
> individual arrays will be identical along the time axis.  I may receive
> truncated textfiles to parse, or new variables may appear and disappear from
> the set being reported/recorded.
>
> If these were in flat files or database tables, I'd do a left outer join between
> a master timestamp table and each individual variable's table.   But... I don't
> know the keywords to search for in the numpy docs/web chatter.  A thread from
> just about one year ago left the question hanging:
>
>    http://article.gmane.org/gmane.comp.python.numeric.general/27942
>
> Examples? Pointers?  Shoves toward the correct sections of the docs?

numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter')

In [23]: numpy.lib.recfunctions.join_by?
Type:             function
Base Class:       <type 'function'>
Namespace:        Interactive
File:             /Users/rkern/svn/numpy/numpy/lib/recfunctions.py
Definition:       numpy.lib.recfunctions.join_by(key, r1, r2,
jointype='inner', r1postfix='1', r2postfix='2', defaults=None,
usemask=True, asrecarray=False)
Docstring:
    Join arrays `r1` and `r2` on key `key`.

    The key should be either a string or a sequence of string corresponding
    to the fields used to join the array.
    An exception is raised if the `key` field cannot be found in the two input
    arrays.
    Neither `r1` nor `r2` should have any duplicates along `key`: the presence
    of duplicates will make the output quite unreliable. Note that duplicates
    are not looked for by the algorithm.

    Parameters
    ----------
    key : {string, sequence}
        A string or a sequence of strings corresponding to the fields used
        for comparison.
    r1, r2 : arrays
        Structured arrays.
    jointype : {'inner', 'outer', 'leftouter'}, optional
        If 'inner', returns the elements common to both r1 and r2.
        If 'outer', returns the common elements as well as the elements of r1
        not in r2 and the elements of not in r2.
        If 'leftouter', returns the common elements and the elements of r1 not
        in r2.
    r1postfix : string, optional
        String appended to the names of the fields of r1 that are present in r2
        but absent of the key.
    r2postfix : string, optional
        String appended to the names of the fields of r2 that are present in r1
        but absent of the key.
    defaults : {dictionary}, optional
        Dictionary mapping field names to the corresponding default values.
    usemask : {True, False}, optional
        Whether to return a MaskedArray (or MaskedRecords is `asrecarray==True`)
        or a ndarray.
    asrecarray : {False, True}, optional
        Whether to return a recarray (or MaskedRecords if `usemask==True`) or
        just a flexible-type ndarray.

    Notes
    -----
    * The output is sorted along the key.
    * A temporary array is formed by dropping the fields not in the key for the
      two arrays and concatenating the result. This array is then sorted, and
      the common entries selected. The output is constructed by
filling the fields
      with the selected entries. Matching is not preserved if there are some
      duplicates...


For some reason, numpy.lib.recfunctions isn't in the documentation
editor. I'm not sure why.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco


More information about the NumPy-Discussion mailing list