[SciPy-user] "clustergrams"/hierarchical clustering heat maps

Damian Eads eads@soe.ucsc....
Sun Feb 15 18:22:49 CST 2009


I would like to propose a design for the heatmap function interface. A
heatmap involves two hierarchical clusterings of the same data. Let's
start with the dendrogram function, some of argument names are derived
from m*tlab so it is already a tested design, in a sense.

def dendrogram(Z, p=30, truncate_mode=None, color_threshold=None,
               get_leaves=True, orientation='top', labels=None,
               count_sort=False, distance_sort=False, show_leaf_counts=True,
               no_plot=False, no_labels=False, color_list=None,
               leaf_font_size=None, leaf_rotation=None, leaf_label_func=None,
               no_leaves=False, show_contracted=False,
               link_color_func=None)

(I know others have disagreed with the use of capital variable names
in the hierarchy module. I carried them over from m*tlab when I first
wrote it for backwards compatability purposes. Also, MATLAB often uses
them to denote matrices, as opposed to vectors. I think denoting this
distinction with capitalization is somewhat helpful, but I'm sure
others disagree. I don't want to get into a flame war about this. I
want to talk about heat maps.)

Since Z is typically used to denote a clustering in m*tlab, and there
are two clusterings, we will need two clusterings Z1 and Z2. Z1 will
be along the observation dimension and Z2 along the attribute
dimension.

The function heatmap will take in parameters for both dendrograms,
which will be suffixed x and y. It is assumed the first dendrogram
will be plotted along the y axis either on the left or the right. The
second one, along the x-axis.

def heatmap(Zx, Zy, p1=30, p2=30, color_threshold=None,
               get_leaves=True, orientation='left-down', labels1=None,
labels2=None,
               count_sortx=False, count_sorty=False,
distance_sortx=False, distance_sorty=False,
               no_plot=False, no_labelsx=False, no_labelsy=False,
               color_listx=None, color_listy=None,
               leaf_font_sizex=None, leaf_font_sizey=None,
               leaf_rotationx=None, leaf_rotationy=None,
               leaf_label_funcx=None, leaf_label_funcy=None,
               no_leavesx=False, no_leavesy=False,
               link_color_funcx=None, link_color_funcy=None)

The orientation parameter can be any of 'left-down', 'right-down',
'left-up', 'right-up'; the first direction in the string specifies
whether to plot the first dendrogram to the left or the right of the
heat map, and the second. All contraction-related parameters have been
removed since they don't really make sense for heatmaps.

All data structures returned are 2 element tuples. For example,
instead of returning a single list of leaf node ids as in
`dendrogram`, two are returned, one per clustering (Lx, Ly).

Since I don't really heatmaps myself, but I'd be willing to write such
a function, I'd appreciate if the end users who want this feature can
give me some feedback on their needs.

Thank you.

Cheers,

Damian

On Sat, Feb 14, 2009 at 10:18 PM, Damian Eads <eads@soe.ucsc.edu> wrote:
> Hi David,
>
> Sorry. I did not see your message until now. Several people have
> already inquired about heatmaps. I've been meaning to eventually
> implement support for them but since I don't work with microarray data
> and I'm in the midst of trying to get a paper out, it has fallen onto
> the back burner. As a first step, I'd need to implement support for
> missing attributes since this seems to be common with microarray data.
>
> As far as I know, a heatmap illustrates clustering along two axes:
> observation vectors and attributes. For example, suppose we're
> clustering patients by their genes. There is one observation vector
> for each patient, and one vector element per gene. Clustering
> observation vectors is the typical case, which is used to identify
> groups of similar patients. Clustering attributes (across observation
> vectors) is less typical but would be used to identifying groups of
> similar genes.
>
> The heatmap just illustrates the vectors, the color is the intensity.
> When clustering along a single dimension (observation vectors), no
> sorting is necessary, and a dendrogram is drawn along the vertical
> axis. The i'th row is just the observation vector corresponding to the
> i'th leaf node. No sorting along the attribute dimension is needed.
> Along two dimensions, there is a dendrogram along the horizontal axis.
> Now the attributes must be reordered so that the j'th column
> corresponds to the j'th leaf node.
>
> This is my first time describing heat maps so I apologize if this
> description is terse. Does it make some sense?
>
> As far as how someone implements this, it seems like it'd be pretty
> simple. There is a helper function called _plot_dendrogram that takes
> in a collection of raw dendrogram lines to be rendered on the plot.
> First, plot the heatmap (sorting the attributes so that the columns
> correspond to the ids of the leaf nodes); this can be done with
> imshow. Second, for the first dendrogram, call _plot_dendrogram but
> provide it with a shifting parameters so that the dendrogram lines are
> rendered to the left of the image. Third, call _plot_dendrogram again,
> provide a shifting parameter, but instead shift the lines downward for
> the attribute clustering dendrogram.
>
> I want to get to this soon but no promises. Sorry.
>
> Cheers,
>
> Damian
>
>
> On Mon, Feb 2, 2009 at 11:12 PM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
>> Hi all,
>>
>> I was recently asked to cluster some data and I know from experience
>> that people use these heat maps to look for patterns in multivariate
>> data, often with a dendrogram off to the side. This involves sorting
>> the rows and columns in a certain fashion, the details of which are
>> somewhat fuzzy to me (and, truthfully, I'm happy with it staying that
>> way for now).
>>
>> I notice that dendrogram plotting is available in
>> scipy.cluster.hierarchy, and was wondering if the something for
>> producing the associated sorted heat maps is available anywhere
>> (within SciPy or otherwise).
>>
>> Many thanks,
>>
>> David
>> _______________________________________________
>> SciPy-user mailing list
>> SciPy-user@scipy.org
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
>
>
>
> --
> -----------------------------------------------------
> Damian Eads                             Ph.D. Student
> Jack Baskin School of Engineering, UCSC        E2-489
> 1156 High Street                 Machine Learning Lab
> Santa Cruz, CA 95064    http://www.soe.ucsc.edu/~eads
>



-- 
-----------------------------------------------------
Damian Eads                             Ph.D. Student
Jack Baskin School of Engineering, UCSC        E2-489
1156 High Street                 Machine Learning Lab
Santa Cruz, CA 95064    http://www.soe.ucsc.edu/~eads


More information about the SciPy-user mailing list