Hierarchical co-clustering: off-line and incremental approaches

Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus “incorporating” data gradually into an on-going co-clustering solution.


Introduction
Clustering is a popular data mining technique that partitions data into groups (called clusters) in such a way that objects inside a group are similar to each other, while objects belonging to different groups are dissimilar [15].When data are represented in a high-dimensional space, traditional clustering algorithms fail in finding an optimal partitioning because of the problem known as the curse of dimensionality.Some distance metrics have been proposed to deal with high-dimensional data (e.g., cosine similarity) and feature selection tries to solve the problem by a reduction in the number of features [26].However, novel approaches have emerged in the last years.One of the most appealing approach is co-clustering [7,16,25] whose solution provides contemporaneously a clustering of the objects and a clustering of the features.Co-clustering algorithms are powerful because they exploit similarity measures on the clusters in one dimension of the problem in order to cluster the other dimension: that is, clusters of objects are evaluated by means of the clusters on the features and vice versa.In this way objects are clustered on the basis of a reduced spacethe clusters of features -and not clustered on the original features whose high number is the source of the problem.
One of the classical aims of clustering is to provide a description of the data by means of an abstraction process.In many applications, the end-user is used to study natural phenomena by the relative proximity relationships existing among the analyzed objects.For instance, he/she compares animals by means of the relative similarity in terms of the common features w.r.t. a same referential example.Many hierarchical algorithms have the advantage that are able to produce a dendrogram which stores the history of the merge operations (or split) between clusters.As a result they produce a hierarchy of clusters and the relative position of clusters in this hierarchy is meaningful because it implicitly tells the user about the relative similarity between the cluster elements.This hierarchy is often immediately understandable: it constitutes a helpful conceptual tool to understand the inner, existing relationships among objects in the domain; it provides a visual representation of the clustering result and explains it.Furthermore, it provides a ready to use tool to organize the conceptual domain, to browse and search objects, discover their common features or differences, etc.It is a conceptual tool especially advisable if one cluster hierarchy -built on one dimension of the problem, the objects -gives insights to study the other dimension of the problem -the features -and gives information to produce the feature hierarchy.In this paper we propose a coclustering algorithm for co-occurrence data that simultaneously produces a hierarchical organization in both of the problem dimensions: the objects and the features.In many applications both of the hierarchies are extremely useful and are searched for: in text mining, for instance, documents are organized in categories grouping related documents.The resulting object hierarchy is useful because it gives a meaningful structure to the collection of documents.On the other side, keywords are organized in groups of synonyms or words with related meaning and this hierarchy provides a semantic network with meaningful insights on the relationships between keywords.In bioinformatics and in other applications, a similar discussion applies: genes or proteins are grouped into clusters sharing a similar behavior while biological experiments by their related, involved functionalities.
In this paper, we tackle the problem of managing an evolving collection of objects.As a matter of fact, in common applications, repositories of documents are not static, but they may evolve over time.New documents may be added to the collection at any moment, and may require to be inserted in the hierarchical structure, following the co-clustering schema.The higher is the number of documents that are added to the collection the more the clustering structure becomes obsolete.The insertion of new documents in the clustering structure is not without consequence for both the hierarchies which might need even to be completely restructured in order to suitably host the new documents.In this paper we deal with this insertion process that updates the co-clustering schema and the hierarchies.As re-executing the whole co-clustering process is computationally expensive, a solution may consist in executing this task less frequently than needed.However, if the frequency of object arrival is quite high, the hierarchies may become inconsistent with the new data, and may lead to erroneous data analysis.Thus, in this paper, we propose an incremental approach which updates existing hierarchical co-clustering structures.We show that the results obtained by an incremental processing of portions of the dataset are substantially similar to those obtained by processing the whole dataset at once.Moreover, the incremental approach is significantly less time-consuming than the complete hierarchical co-clustering process.
The key contributions of this paper are the following.We present HiCC, a co-clustering algorithm for co-occurrence data, that are datasets in which the objects are represented by the occurrence of features.HiCC simultane-ously produces two hierarchies of clusters: one on the objects and the other one on the features.HiCC employs an unusual cluster association measure in co-clustering: Goodman-Kruskal τ .It has been considered as a good choice in a comparative study on several evaluation functions for co-clustering [29,28].As we have shown also in [20] and will show also here, in the experimental section, these hierarchies are meaningful to the end-user and are valid also under the viewpoint of several objective measures.In fact, HiCC is able to produce compact hierarchies because it produces n−ary splits in the hierarchy instead of the usual binary splits.This compactness improves the readability of the hierarchy.One of the interesting novelties of HiCC is that it is parameter-less: the user is not asked to provide the number of clusters for the two dimensions, which is not easy to set and usually requires a pre-processing stage.Last but not least, in this paper we formulate an incremental version of our algorithm which requires less computational time, but guarantees almost the same results than the standard version.
The paper is organized as follows: Section 2 presents an example that motivates the choice of using Goodman-Kruskal's association measure for coclustering.We present the core of our approach in Section 3, and its incremental version in Section 4. The related experimental validation is detailed in Section 5. Section 6 analyzes the literature related to co-clustering and to the hierarchical models.Finally Section 7 concludes and presents some future perspectives of our work.

An introduction to Goodman-Kruskal τ
Goodman-Kruskal τ [13] has been originally proposed as a measure of association between two categorical variables: it is a measure of proportional reduction in the prediction error of a dependent variable given information on an independent variable.In Figure 1  Let us take one of the two variables (e.g., Salary) as the dependent variable whose values should be predicted from the values of the other variable (Job), considered as the independent variable.τ Salary|Job determines the predictive power of the variable Job for the prediction of the variable Salary.The predictive power of Job is computed as a function of the error in the classification of the value of Salary.
The prediction error is first computed when we do not have any knowledge on the values of Job.E Salary denotes this error here.The reduction of this error allowed by Job is obtained by subtraction from E Salary of the error in the prediction of Salary that we make when we have the knowledge of the value of Job (the independent variable) in any database observation.E Salary|Job denotes this latter error.The proportional reduction in the prediction error of Salary given Job, here called τ Salary|Job , is computed by: E Salary E Salary and E Salary|Job are computed by a predictor which uses information from the cross-classification frequencies (d ij ) and tries to reduce as much as possible the prediction error.In the prediction, it also preserves the dependent variable distribution (relative frequencies of the predicted categories of Salary) in the following way: when no knowledge is given on Job, the i-th value of the row variable Salary is predicted by the relative frequency T Salary=i value /T where by T Salary=i value we denote the total frequency of observations with the i-th value of Salary and by T the total number of observations.On the contrary, when Job is known for an example, the value of Salary is predicted with the relative frequency d ij /T Job=j value where by T Job=j value we denote the total number of observations with the value of Job in the j-th column.Therefore, E Salary and E Salary|Job are determined by: Analyzing the formulation of τ , we observe that it satisfies many desirable properties for a measure of association.For instance, it is invariant by rows and columns permutation.Secondarily, it takes values between (0,1) (it is 0 iff there is independence between the values of the variables).Finally, it has an operational meaning: given an observation, it is the relative reduction in the prediction error of the observation's dependent variable, given the knowledge on the observation's independent variable.
We intend to use τ as measure of validation of a co-clustering solution that produces an association between the clusters coming from two partitionings: one on the values of the independent variable, and the other on the values of the dependent variable.We start our discussion on co-clustering by considering the contingency table shown in Figure 2.
The contingency table of Figure 2 represents a co-clustering of the original dataset of Figure 1.Symbol RC i represents i-th cluster on rows while CC j Fig. 2 A co-clustering solution.
represents j-th cluster on columns.T RCi is the total counting of rows in cluster RC i while T CCj is the total counting of columns in cluster CC j .T is the global total.In this example we suppose RC 1 has aggregated the rows of the original table indicated by Salary in {Low, Medium} while RC 2 has aggregated rows indicated by Salary=High.
Similarly, CC 1 has aggregated columns indicated by Job in {Clerk, Teacher} while CC 2 contains columns indicated by Job in {Manager, Journalist}.Cocluster (RC i , CC j ) is represented by the value t ij stored in the cell at the intersection of the i-th row cluster and the j-th column cluster.It has been computed by application of an aggregating function on the columns and rows of the original table.
The aggregation that produces the co-cluster (CC i , RC j ) is the following: where x ranges over the possible values of the variable Salary and y ranges over the possible values of the variable Job.In our proposal the aggregation process is guided by the τ measure.This measure allows to evaluate, at each step of the algorithm, the association between the two partitions (one over the rows and one over the columns).As previously shown the τ measure is asymmetric.To take into account the predictive power of both partitions the process alternates the optimization of the prediction of the row clusters given the column clusters τ RC|CC and vice versa τ CC|RC .

Hierarchical Co-Clustering
In this section we present our method, named HiCC (Hierarchical Co-Clustering by n-ary split).Let us introduce the notation.

Notations
Before introducing our approach, we specify the notation used in this paper.Let X = {x 1 , . . ., x m } denote a set of m objects (rows) and Y = {y 1 , . . ., y n } denote a set of n features (columns).Let D denote a m × n data matrix built over X and Y , each element of D representing a frequency, a count or a binary (presence/absence) information.
Given the above described matrix D defined on the set of objects X and on the set of features Y , the goal of our hierarchical co-clustering algorithm is to find a hierarchy of row clusters R over X, and a hierarchy of column clusters C over Y .Supposing that R has K levels, and that C has L levels, R and C are defined as R = {R 1 , . . ., R K } and C = {C 1 , . . ., C L } (where R 1 and C 1 are the clustering at the roots of the respective hierarchy).
Each R k ∈ R is a set of disjoint row clusters including all the rows in X. Formally R k = {r k1 , . . ., r k|R k | } where |R k | is the total number of clusters in R k , r ki ⊆ X, i r ki = X and ∀i, j s.t.i = j r ki ∩ r kj = ∅.Similarly, C l = {c l1 , . . ., c l|C l | }, and the analogous conditions hold for C l too.R defines a hierarchy on the row clusters.Each R k must also satisfy the following conditions: Our approach first computes the first level of the hierarchy (R 1 and C 1 ).Then, it builds R 2 by optimization of τ R2|C1 , having set C 1 .In general, given a generic hierarchy level h, Algorithm HiCC alternates the optimization of τ R h |C h−1 and τ C h |R h , finding respectively the appropriate row clusters in R h and the column clusters in C h , constrained by the clusters more recently formed in the other dimension (respectively, in the partitions C h−1 and R h ).To compute each level of the hierarchy, we adopt an iterative algorithmic approach inspired by τ CoClust [28], whose goal is to find a partition of rows R and a partition of columns C such that Goodman-Kruskal's τ C|R and τ R|C are optimized [29,30].This choice is motivated by the fact that this algorithm produces good quality partitions with an arbitrary and non predefined number of clusters on rows and columns.Before presenting in details our algorithm we briefly introduce τ CoClust.

Co-clustering with Goodman-Kruskal's τ
Algorithm τ CoClust [28] can be formulated as a bi-objective combinatorial optimization problem [22] which aims at optimizing two objective functions based on Goodman-Kruskal's τ measure.The goal of τ CoClust is to find a partition C over the feature set Y , and a partition R of the object set X such that where P is the discrete set of candidate partitions and τ is a function from P to Z, where Z = [0, 1] 2 is the set of vectors z = z 1 , z 2 in the objective space, with z 1 = τ R|C and z 2 = τ C|R .
To compare different candidate partitions the algorithm implements a Pareto-dominance approach.The goal is to identify optimal partitions P opt ⊂ P, where optimality consists in the fact that no solution in P \ P opt is superior to any solution of P opt on every objective function.This set of solutions is known as Pareto optimal set or non-dominated set.Recently an extension of this algorithm to multiview clustering has been shown to converge in a finite number of steps to a Pareto-optimal solution [21].These concepts are formally defined below.
Definition 1 (Pareto-dominance) An objective vector z ∈ Z is said to dominate an objective vector z ′ ∈ Z iff z 1 ≥ z ′ 1 and z 2 > z ′ 2 or vice versa.This relation is denoted z ≻ z ′ hereafter.
Definition 2 (Non-dominated objective vector and Pareto optimal solution) An objective vector z ∈ Z is said to be non-dominated iff there does not exist another objective vector z ′ ∈ Z such that z ′ ≻ z.
A solution p ∈ P is said to be Pareto optimal iff its mapping in the objective space (τ (p)) results in a non-dominated vector.
Thus, the τ CoClust co-clustering problem is to seek for a Pareto optimal solution of equation (2).
τ CoClust is shown as Algorithm 1.This algorithm takes as input the original dataset D and the number of iterations N iter .As a first step it initializes R and C at the discrete partitions respectively over rows and columns.As a second step the function ContingencyTable is applied over the dataset D using R and C.This function applies the equation (1) over D to obtain the initial contingency table w.r.t. the two given partitions.Then the algorithm alternates the optimization of the Goodman and Kruskal τ over both dimensions.[30] proposes a heuristic with the aim of finding a local optimum of the two coefficients τ C|R and τ R|C .Given two co-clusterings matrices, T and T ′ obtained from the same original data matrix D, the differential τ R|C between T and T ′ is given by: Analogously, ∆τ C|R is given for τ C|R .The algorithm that finds the partitions on rows or on columns and that optimizes τ R|C and τ C|R is unique.The algorithm is optimizePartition and it is shown as Algorithm 2. When the partition on the rows is computed, R is modified and C is fixed.When the partition on the columns is computed, it is the reverse.Thus, for sake of brevity, we introduce two parameters of the algorithm: U for the partition to be modified; and V for the fixed partition.When U = R, V = C and vice versa.At the end, among all the possible clusters, it chooses the cluster u emin that minimizes the difference ∆τ U|V with the following constraints: u e = u b and u e ∈ {U ∪∅}.The loop from line 4 to line 9 applies function ∆ τ U |V (o, u b , u e , V ) that incorporates object o to cluster u e and computes the improvement in the objective function τ U|V .In [30] the computation of ∆ τ U |V is performed in an efficient way without computing the complete contingency table T ′ but only considering the effect of incorporation of o to each u e .
Finally, at line 11 it updates the contingency table T and the row/column cluster partition.To perform this step, it first moves the object o from the cluster u b (and delete u b if its only member was o) to the cluster u emin (by creating it if it contains only o).Then it updates the contingency table T consequently.
Thanks to this strategy, the number of clusters may grow or decrease at each step, and their number only depends on the effective optimization of τ U|V .This makes this approach substantially different from the one described in [7] and in other state-of-the-art co-clustering approaches, where the number of co-clusters is fixed as a user-defined parameter.if u min = e {update selected cluster} 7:

Description of Algorithm HiCC
We can now present the details of our hierarchical co-clustering approach.The initialization procedure is presented in Algorithm 3 while the whole procedure of construction of the hierarchies (named buildHier) is presented in Algorithm 4. HiCC adopts a divisive strategy.At line 1, HiCC initializes the partitions by calling function τ CoClust (see Algorithm 1).Thus it builds the first level of both the hierarchies.Then it starts building the hierarchies by calling function buildHier.At line 1 BuilHier initializes the indices that represent the levels for the two hierachies.From line 3 to line 15 it starts building the level k + 1 of the hierarchy on the rows.Then, from line 16 to line 28 it builds the level l+1 of the hierarchy on the columns.We explain here in detail only the construction of the hierarchy on the rows since the construction of the hierarchy on the columns is analogous.At line 4 each row cluster in the partition at level k of the hierarchy is considered.At line 5 each row cluster r ki ∈ R k , is split into a new set of row clusters R ′ i using RandomSplit function.This function first sets the cardinality of R ′ i randomly; then it randomly assigns each row in r ki to a cluster of R ′ i .Subsequently, it initializes a new contingency table (here denoted by T ki ) related to the portion of the contingency table relative to cluster r ki and to the set C l of column clusters (we keep all the column clusters found at current level l).Without changing columns partition, the algorithm optimizes τ R ′ i |C l using the optimizeP artition function.It returns a new and optimized R ′ i .After all r ki have been processed, the algorithm adds the new level of the row hierarchy to R (line 14).Column clusters are then processed in analogous way, using the row cluster assignment just returned.In this way the two hierarchies grow until the T ERM IN AT ION condition is reached.The T ERM IN AT ION condition is satisfied when all leaves of the two hierarchies contain only one element.Obviously, a cluster may be split into singletons at a higher level than other clusters.At the end, HiCC returns both the hierarchies over rows (R) and columns (C).

Local convergence of τ CoClust
As shown in [30,28], the local search strategy employed to update partitions, sometimes leads to some degradation of τ R|C or τ C|R .This is due to the fact that an improvement on one partition may decrease the quality of the other Algorithm 4 buildHier(R, C) R k+1 ← ∅ {initialize next row hierarchy level} 4: for all r ki ∈ R k do 5: R ′ i ← RandomSplit(r ki ) {split current cluster level into random sub-clusters} 6: t ← 0 7: one.However, in [21] it was shown that the solution converges to a Paretolocal optimal solution in a finite number of steps.Considering iterated local search algorithms for multi-objective optimization requires to extend the usual definitions of local optimum to that context [27].Let N : P → 2 P be a neighborhood structure that associates to every solution p ∈ P a set of neighboring solutions N (p) ⊆ P. A Pareto local optimum with respect to N is defined as follows.
Definition 3 (Pareto local optimum) A solution p ∈ P is a Pareto local optimum with respect to the set of solutions N (p) that constitutes the neighborhood of p, if and only if there is no q ∈ N (p) such that τ (q) ≻ τ (p) (see equation ( 2)).
Let p = R ∪ C be an element of P. In algorithm τ CoClust (see Algorithm 1), the neighborhood N of p is defined as the set of all candidate solutions determined by the movement of the element x from the cluster r b to the cluster r e in R, or by the movement of the element y from the cluster c b to the cluster c e in C. As proved in [21], the following theorem holds: The complete proof (omitted here) is presented in [21].
In our hierarchical approach (see Algorithm 4), instead of updating a cluster at a time, alternating the dimensions, a whole partition of one of the dimensions is updated, a single cluster at a time, while keeping the other partition fixed.Each update is accepted provided that the objective function increases.Thus, the algorithm ensures that the objective function increases at each iteration.

Complexity Discussion
HiCC complexity is influenced by three factors.The first one is the number of iterations.It influences the convergence of the algorithm.A deep study of the number of iterations is presented in [28].The second factor is the number of row/column clusters in optimizeP artition function.This number of clusters influences the swap step which tries to optimize the internal objective function moving one object from a cluster identified by b to another cluster identified by e.The number of possible different clusters that e can identify influences the speed of the algorithm, but this number varies during the optimization.The third factor is the depth of the two hierarchies, in particular the deeper one.This value influences the number of times the main loop of Algorithm 4 is repeated.If we denote by N the number of iterations, by c the mean number of row/column clusters inside the optimization procedure, and by v the mean branching factor, we observe that the complexity of a split of r ki and c lj is equal as average to O(N × c).Each split is performed for each node of the two hierarchies except for the bottom level.The number of nodes in a tree with branching factor v is levels i=0 v i where levels is the number of levels of the tree.We can expand this summation and we obtain 1−v 1+levels 1−v .
From the previous consideration we estimate the complexity of our approach . The worst case is verified when any split is binary and at each level c = n, where n is the number of rows/columns.In this case, the complexity is O(N × (n − 1) × n), (n − 1) being the number of internal nodes in the hierarchy.In conclusion, in the worst case, the overall complexity is O(N n 2 ).In the experimental section, we show that our algorithm often splits into more than two clusters, thus reducing the number of internal nodes.Moreover, assumption c = n is very pessimistic.In general our algorithm runs in linear time with the number of iterations, and in subquadratic time with the number of objects/features.Notice also that in this work, the number of iterations is constant, but it could be adapted to the size of the set of objects, and then reduced at each split.

Incremental hierarchical co-clustering
In this section we tackle the problem of dealing with evolving collections of data, i.e., datasets that grow with time.As an example, consider a dataset of documents published by a newsgroup and described by the occurrence of a set of words.A newsgroup is dynamic and it is updated frequently.A precomputed model which takes into account only the first portion of posts (documents) may become obsolete after that a certain amount of new documents have been added to the database.Thus, the hierarchical co-clustering model needs to be updated in order to take into account the new data.This operation may be performed by re-launching the hierarchical co-clustering process on the whole updated dataset.However, this may require a high computational time, and make the analyst's experience frustrating, since it has to wait for the new co-clustering to be computed.
To overcome this problem, we introduce an incremental version of HiCC algorithm that starts from a previously computed hierarchical co-clustering and updates it.From the results of this paper, we will see that in the revised version of HiCC, that we will name HiCC incr , less computational resources are needed than in the traditional one, when the algorithm is re-launched on the complete dataset.As shown by the experiments in Section 5, computing only the hierarchical structure (step at line 4 of Algorithm 3) on the whole dataset, having already initialized the initial partitions, requires significantly less time than computing also the initial (first level) splitting (that corresponds to the execution of τ CoClust at line 2).
As a consequence, we focus our incremental optimization on this first step, and let unvaried the remaining part of the algorithm.This is also motivated by the fact that our algorithm is divisive: variations of the first co-clustering level may make inconsistent the existent hierarchy.
Another important consideration concerns the way our algorithm handles the new incoming objects.New objects may be processed one-by-one or in packets.The choice depends mainly on the application and the data, and on how an obsolete hierarchical co-clustering model is tolerated by the analyst's needs.

Incremental version of τ CoClust
In the remainder of this section, we consider a pre-computed hierarchical coclustering on a first portion of the dataset, named D and an updated portion ∆ D of arbitrary size.We assume that D is built over the object-set X and the feature-set Y , and ∆ D is built over the object-set ∆ X and the feature-set ∆ Y .In the applications we consider here, X ∩ ∆ X = ∅, while Y ∩ ∆ Y is not empty.For instance, our framework can be applied to document repositories that are periodically fed by adding new documents, while the existing ones are not updated.However, new objects may contain new features as well as features that were already present in the initial portion of the dataset.For instance, part of the terms contained in new documents may appear also in old ones, but they may probably contain a certain number of new words.We think that our assumption is quite general and allows us to include many existing application schemas, typical of online evolving data repositories.
We formulate our incremental approach as follows.Starting from a previous co-clustering schema (R, C) over ∆ D we generate a new result (∆ R , ∆ C ) that incorporates the information carried by ∆ D .The overall incremental hierarchical co-clustering approach is described by Algorithm 6.It takes the original dataset D, the updating dataset ∆ D and the first level of the co-clustering over D (given by R and C) as input.It provides a hierarchical co-clustering (R, C) on D ∪ ∆ D .The algorithm is similar to the non-incremental version (see Algorithm 4), except that, instead of partitioning the whole dataset from scratch, it employs the pre-computed co-clustering (R, C), by calling τ CoClust incr (Algorithm 5).Algorithm 5 describes the procedure that extends τ CoClust (i.e.Algorithm 1) to the incremental setting.
The new procedure takes as input the two dataset portions D and ∆ D , the previous co-clustering results over D, i.e., (R, C), and a number of iterations.At line 1 of Algorithm 5, the row set ∆ X of ∆ D is initialized by assigning each row to a singleton cluster.Instead, columns in {∆ Y ∩ Y } are assigned to the corresponding cluster of C, while each column {∆ Y \ Y } is assigned to a singleton cluster.Finally, columns in {Y \ ∆ Y } are zero-filled and assigned to the corresponding cluster of C. We call the resulting partitioning ∆ R and and 3).Given the so computed R and C, the algorithm initializes the contingency table T over D = D ∪ ∆ D (line 5).In other words, after these preliminary steps, the contingency table T contains as many row clusters as R ∪ ∆ R , and as many column clusters as C ∪ ∆ C .Finally (lines 6-10) the algorithm alternatively optimizes R and C, following the updating criteria given in Section 3, and described by function optimizePartition (see Algorithm 2).At the end, the original R and C partitions have been rewritten and constitute the final optimized output (line 11).In the next section, we provide a comprehensive experimental analysis that shows that the incremental version provides good hierarchical co-clustering results, but requires significantly less computational time.

Experimental validation
In this section we report on several experiments performed on real, highdimensional, multi-class datasets.We compare our approach with Information-Theoretic Co-Clustering (ITCC) [7], a well-known co-clustering algorithm which minimizes the loss in mutual information.To evaluate our results, we use several objective performance parameters that measure the quality of clustering.Besides the precision of the hierarchical co-clustering we analyze also the hierarchies returned by our approach for both rows and columns.This analysis allows to emphasize the utility of the hierarchical structure w.r.t.standard flat co-clustering approaches.Furthermore, we provide an exhaustive set of experiments to point out the differences in computational time and quality of the results between the off-line approach and the incremental one.All experiments are performed on a PC with a 2.6GHz Opteron processor, 4GB RAM, running Linux.
The section is organized as follows.We first introduce the datasets and the quality indices we adopted in our experimental study.Then, we show and analyze some examples of co-cluster hierarchies that HiCC is able to provide and compare them with the partitions found by ITCC.The stability of HiCC is then assessed both in terms of average depth of the hierarchies and their content.Finally, the incremental version of our algorithm is evaluated by considering it in both flat and hierarchical co-clustering settings.The stability of the results is studied in this case as well.

Datasets for the evaluation of HiCC and its results
To evaluate our results we use some of the datasets described in [10].In particular we use: -oh0, oh15: two samples from OHSUMED dataset.OHSUMED is a clinicallyoriented MEDLINE subset of abstracts or titles from 270 medical journals over five-year period (1987-1991).All datasets have more than five classes, which usually is a hard context for text categorization.The characteristics of datasets are shown in Table 1.

External Evaluation Measures
We evaluate the algorithm performance using two external validation indices.We denote by C = {C 1 . . .C J } the partition built by the clustering algorithm on objects at a particular level, and by P = {P 1 . . .P I } the partition inferred by the original classification.J and I are respectively the number of clusters (|C|) and the number of classes (|P|).We denote by n the total number of objects.
The first index is the Normalized Mutual Information (NMI).NMI provides an information that is impartial with respect to the number of clusters [33].It measures how clustering results share the information with the true class assignment.NMI is computed as the average mutual information between every pair of clusters and classes: where x ij is the cardinality of the set of objects that occur both in cluster C j and in class P i ; x j is the number of objects in cluster C j ; x i is the number of objects in class P i .Its values range between 0 and 1.
The second measure is the adjusted Rand index [19].Let a be the number of object pairs belonging to the same cluster in C and to the same class in P.This metric captures the deviation of a from its expected value corresponding to the hypothetic value of a obtained when C and P are two random, independent partitions.The expected value of a denoted by E[a] is computed as follows: where π(C) and π(P ) denote respectively the number of object pairs from the same clusters in C and from the same classes in P. The maximum value for a is known to be: The agreement between C and P can be estimated by the adjusted rand index as follows: Notice that this index can take negative values, and when AR(C, P) = 1, we have identical partitions.
To obtain a quality measure of HiCC, for each level i of the hierarchy on the rows we select the corresponding level of the hierarchy on the columns.These levels define a pair of partitions: the first partition comes from the row cluster hierarchy, and the second one from the columns cluster hierarchy.On the pair of partitions from level i, an evaluation function EF i is computed on the basis of Goodman-Kruskal τ S [14], which is a symmetrical version of τ [13] whose aim is to quantify the agreement between the two partitions.In order to compute an overall measure for the co-clustering we compute the following weighted mean: where α i is the weight associated to the i-th level of the hierarchy, and allows to specify the significance assigned to the i-th level w.r.t. to the other levels of the hierarchy.Of course, EF i might refer to any other validation index presented beforehand.
(3) is a general formula for the evaluation of the goodness of a run of our method.In this work we set α i = 1/i.Indeed, we give a heavy weight to the top level and the lowest weight to the last level.This choice is motivated by observation that in a hierarchical solution the clusters on a level depend on the clusters at previous level.If we start with a good clustering, then next levels are more likely to produce good results too.

Inspecting hierarchies
In this section we analyze in a qualitative way two hierarchies built by our algorithm.In particular we analyze the row hierarchy for oh0 data and the column hierarchy from re1 data.We chose these two datasets because oh0 has an understandable class description and re1 uses a vocabulary (feature space) with common words.In Table 2 we report the first three levels of the row hierarchy produced by HiCC.To obtain this hierarchy we assigned at each cluster a label.We chose the label of the majority class in the cluster.Each column in Table 2  meaningful to a domain expert.In fact, the first branch is homogenous, since each node contains a majority of elements coming from the Aluminum class.
Cluster Enzyme-Activation at first level is split into two clusters: the first one has the same label, the second is Cell-movement, a topic more related to Enzyme-Activation than to the other cluster labelled Staphylococcal-Infections.
Then Cell-movement is split into two clusters.One of these is Leucine, which is a fundamental amino acid of the protein cells.In the last branch of the hierarchy we notice that Uremia cluster is a child of Staphylococcal-Infections (a renal failure with bacterial infection as one of its causes).
In Table 3 we report instead the first two levels of the column hierarchy produced by HiCC.In order to assign a label to the clusters, we computed for each cluster the mutual information between each word and the cluster; then we ranked the set of words for each cluster, as described in [32].Finally, we selected the 10-top words for each cluster and use them in Table 3 to describe the clusters.In this way we identified the words whose meaning was almost exclusively connected to each cluster.Again, we can notice that the generated column hierarchy is meaningful.At the first level of the hierarchy each of the six clusters is about a well distinct topic: the first one is about coffee and cocoa.The second one is on agriculture.The third and the forth ones are about gulf war and oil economy.Finally the fifth and the sixth clusters are about syndicate and work and Aegean politics.The second level introduces a further specialization of the top-level clusters.For instance, the first cluster is split into coffee and cocoa production and agricultural economics.The fourth one is split into oil industry and finance.

HiCC vs Partitional Co-Clustering
Here we evaluate HiCC performance w.r.t ITCC [7].HiCC is non-deterministic like other well-known clustering algorithms such as K-means or ITCC itself.At each run we can obtain similar, but not equal, hierarchies.For this reason we run HiCC 50 times over each dataset.We set the number of iterations for the first level equal to 10 times the maximum between the number of instances and the number of attributes.Instead from the second levels of the hierarchy (where the optimization is independent for each dimension) we run the process until these two conditions are satisfied: a) the total number of iterations is more than 50,000 and b) the number of iterations which end with no changes in the cluster structure is smaller than the number of objects/attributes.These two conditions are standard criteria adopted in partitional clustering algorithms.From the set of row/column co-hierarchies obtained in the different runs, we choose the ones that better optimize an internal evaluation function.
To compare each level of our hierarchies with ITCC results, we need to fix a number of row/column clusters to set ITCC parameters.We recall that ITCC is flat and does not produce hierarchies.For this reason we plan our experiments in the following way.Since HiCC is not deterministic, each run may produce partitions of different cardinality at each level.For this reason, we need to select one specific run of HiCC.Using Goodness function with τ S as evaluation function, we choose for each dataset the hierarchy with the highest Goodness value.This hierarchy is a representative solution whose selection is not biased by the external index (based on the classes) that will be used for the final comparison evaluation.
From this hierarchy, we obtain a set of pairs (#numberRowCluster, #num-berColCluster), where each pair specifies the number of clusters in a level of the hierarchy.For each of these combinations, we run ITCC 50 times with (#num-berRowCluster,#numberColCluster) as parameters, and average for each level the obtained results.
In Table 4 we show the experimental results.To obtain a single index value for each dataset we compute the previously proposed Goodness having as evaluation function (EF i ) each of the two external validation indices.These two indices are computed between the partition on the objects given by the clusters and the partition on the objects given by the classes.From the results we can see that our approach is competitive w.r.t.ITCC.Notice however that, for a given number of desired co-clusters, ITCC tries to optimize globally its objective function, and each time it starts from scratch.Thus, one would expect that ITCC provides better results for each pair of number of clusters.
On the contrary, HiCC solution at level i in the hierarchy is constrained by clusters found at level i − 1 in the same hierarchy.Thus one would expect that this behaviour would propagate the errors level by level.Instead, we can notice that ITCC is not more accurate than our algorithm.This phenomenon has been recently observed in [6], where a hierarchical model selection is used to help improving results in prediction tasks.To clarify this point we report the complete behavior of the two algorithms in an example.In Table 5 we report the value of the two indices for each level of the hierarchy obtained by HiCC for re1.For the sake of brevity, we can only show here one example, but in all the other experiments we observe the same trend here described.In the same table we show also the values obtained by ITCC using the same number of clusters (#numberRowCluster,#numberColCluster) discovered by HiCC.We also report the standard deviation for ITCC, since for each pair of cluster numbers, we run it 50 times.We can see that HiCC outperforms ITCC, especially at the higher levels (first, second and third) of the row hierarchy.We notice also that NMI index always increases monotonically in HiCC but not in ITCC.This experiment shows that, when we explore deeper levels of the hierarchies of HiCC, the confusion inside each cluster decreases.

Stability of the induced Hierarchies
The intrinsic nature of HiCC is non-deterministic.As such, two instances of the algorithm processing the same dataset may provide two different results.Here, we measure the stability of our approach, in terms of quality and depth of the hierarchies, and their structure.In Table 6 we report the average Goodness of τ S for each dataset, and the related standard deviation, computed over one, five and all hierarchy levels respectively.We observe that standard deviation is very low w.r.t. the average Goodness.From this empirical evaluation we can conclude that the quality of the hierarchical co-clustering is quite stable.In Table 7 we show the mean depth of the row hierarchy and of the column hierarchy for each dataset.We observe that the standard deviation is low.This points out that our algorithm is stable also from this point of view.Notice that HiCC generates hierarchies which are not deep, if the number of levels is compared with the cardinality of the object and attribute sets.Shorter hierarchies are preferable to hierarchies obtained only by binary splits, since they allow a compact representation of the data and they improve the exploration of the results because are easy to browse from a user point of view.
In order to evaluate the stability of the produced hierarchies from a different viewpoint, we adapted the strategy of summarizing and indexing hierarchies presented in [34,24] namely, by Concept Propagation/Concept Vector (CP/CV).CP/CV assumes that each cluster node of the hierarchy is projected into a concept space and represented by a vector with as many dimensions as the cardinality of the concept space.In our context the concept space is given by the class labels.The CP/CV approach combines the concept space representation with the structure of the hierarchy.It adopts a process that propagates several times the information from the parent nodes to their children and vice versa, and takes into account the tree structure together with its content.The authors suggest that, after a sufficient number of propagation steps, a good summary of the whole concept hierarchy is obtained in the root vector.They show that using only the root node vector, as a candidate summary, they are able to perform a high quality indexing of the hierarchy.We adopt this technique to summarize the content of the hierarchies and quantify their stability in terms of the class concept space.We represent each node of the hierarchy by a vector having as many components as the number of classes.Each vector contains the distribution of the objects classes within the related node.Thanks to the CP/CV method we obtain for each result of HiCC one vector summarizing the whole hierarchy.To quantify the stability of our approach, we employ a simple strategy: we compute the distance between any pair of summary vectors obtained by different runs of HiCC.Then we compute the average and the standard deviation of these distances.Here, we use the cosine distance: where X and Y are two vectors and || • || is the Euclidean norm of a vector.
For each dataset, we run the algorithm 50 times.The average and standard deviation of the cosine distances are computed over these 50 instances.We report the results in Table 8.
We observe that the average distance is very low.Even though the standard deviations are as high as the average, the order of magnitude is still very low.This means that the stochastic optimization process employed by our approach does not affect the stability of the final result.

Evaluation of τ CoClust incr
In this section we evaluate the behavior of τ CoClust incr (see Algorithm 5 in Section 4).Before introducing this experimental study, we further motivate the incremental version providing a brief report on the time performances of HiCC.In Table 9 we show the average time (with standard deviation) for the two components of the algorithm.We observe that the step which requires most of the computational time is the first one (τ CoClust).In general the time employed by the first step is one order of magnitude greater than the  time employed by the second step (buildHier).This observation has motivated the incremental version presented in this work (see Section 4 for the details).Now we will show that the use of the incremental approach helps to decrease the computational time spent by τ CoClust.This improvement allows to speed up the whole procedure and it allows the employment of HiCC in an incremental/on-line scenario.
To this purpose, using all the datasets described beforehand, we simulate a simple incremental scenario.We divide each dataset into two blocks.The first block is supplied to the τ CoClust to obtain a first bi-partition.τ CoClust incr starts from the previous result and updates the bi-partition using the second block.The goal of this first group of experiments is to prove that the results provided by τ CoClust incr are not very dissimilar from those provided by the off-line version (see Algorithm 1 in Section 3).We let vary the size of the first block from 5% to 90% of the entire dataset (with a step of 5%).To evaluate the results we set up two types of comparison: a) using the partition induced by the class variable as reference; b) using the partition computed by the off-line version as reference.We average the results over 50 runs of both the algorithms.Here, we use the same sets of features for the two blocks.If in the first block some features are not represented, they simply contains zeros for all the rows.This choice does not influence the final results, since columns containing only zeros have no impact on the computation of the objective function.
In Figure 3 we plot the time performances of the incremental version.The X axis represents, in percentage, the size of the first block.We compute the sum of the time spent for processing the first block plus the time spent for the second block.
The various vertical segments (box plots) represent the average time spent to process the different blocks.The horizontal line represents the average time spent by the off-line version of the algorithm.We observe that, in general the time required by the incremental version is always smaller than the time spent by the off-line version.Notice that all the curves share a similar general trend.For very small sizes of the initial block, the time employed by τ CoClust incr is similar to that taken by τ CoClust.The rationale is that the initial sample is not that representative of the entire dataset and thus the arrival of a new big block arises the whole re-computation of the co-clustering.Similarly, high sizes of the initial block require almost the same time needed by the computation of the entire dataset.The computational time of the two versions is, again, similar.In the middle, for small block sizes (20% to 40%) the incremental algorithm employs from 2 times up to 10 times less than the off-line algorithm to complete co-clustering.The only exception is tr21.This may be partly due to the fact that this dataset contains very few documents (about 300), but a huge number of features (about 8000).
We consider now the quality of the results provided by τ CoClust incr and compare them with the results provided by τ CoClust.The first comparison takes into consideration the NMI and ARI indexes computed w.r.t. the class variable (see Figures 4 and 5).We observe that the results are always close to those provided by the off-line version (the horizontal line in the plots).With 20% to 30% of initial block size, the performance are already reasonably good and the computational time is kept at its lowest level (see Figure 3).However these experiments only show that the agreement between the discovered partitions and the class partitions are maintained.
To evaluate the distortion introduced by the incremental partitioning, we measure the NMI and ARI indexes w.r.t. the original partitioning.In Figure 6 and 7, we plotted the average values of the two indexes together with the standard deviation.The average values are computed by considering each of the 50 runs of the incremental algorithm with each of the 50 runs of the offline version.The horizontal line represents the average of the index values computed by the off-line version only.First, notice that all the indexes are significantly high.In general, the NMI and ARI computed for the incremental results are close to the average results which consider only the off-line version.This confirms that, the distortion introduced by the incremental version is comparable to the amount of natural unstability of the algorithm, due to its stochastic optimization approach.These empirical evidences confirm that this strategy can be adopted also to manage standard datasets, since it enables to speed up the entire co-clustering process.
Finally, we analyze the incremental algorithm from another point of view.Due to the stochastic nature of our optimization solution, we cannot assure that similar clusters belonging to two different blocks will always merge in the final partition.This may lead to uncorrect interpretations of the results as a negative side-effect.In practice, however, this condition never happens.In Table 10, we report the number of clusters found by our algorithm given different dimensions of the initial block.In almost all the datasets, these values are stable and are not correlated to the percentage of the block size.We may notice significant variations only for very low percentages (10%), which may be explained by the poor representativity of the initial sample.In re1 the variation (still low), seems stronger than in the other datasets, but it is within the typical tolerance range for this particular dataset.As a side observation, we might notice that the standard deviation of the values is in general quite low: this means that, given a dataset, our approach converges always to very similar solutions, in spite of its stochasticity.

Evaluation of HiCC incr
We showed that the results of the co-clustering algorithm on an initial portion of the data, can be used as the initial splitting function for an incremental approach.This strategy can be more generally be applied to an incremental framework without loss of quality.Now we focus on the overall hierarchical co-clustering procedure, and show that the resulting hierarchies are close to those obtained by the off-line version.To quantify the quality of the produced hierarchies we use the method of CP/CV (described in Section 5.5).Here, we compare the hierarchies generated by HiCC incr with the hierarchies generated by HiCC.To perform this analysis, we split the original dataset into multiple blocks (we let vary the number of blocks, from 5 to 20 per dataset).Each dataset is handled as follows: 1) the first block is processed by HiCC; 2) the following block is processed by HiCC incr using the result on the first block as starting point; 3) the process is iterated incrementally for each of the remaining blocks.To obtain statistically relevant results we launch each algo- rithm 50 times.Afterwards we compute the cosine distance between the root vectors representing the hierarchies obtained by HiCC and the root vectors of the hierarchies obtained from HiCC incr .In Figure 8 we report the average and the standard deviation of the cosine distance over the 50 runs.We can notice that all the distances are below 0.1.However, in two cases, they are really very low (below 0.01; we recall here that the cosine distance takes values between 0 and 1).This means that the incremental version of our algorithm does not change substantially the returned hierarchies.the link strength between the two variables.In both algorithms, a local optimization method is used to optimize the measure by alternatively changing a partition when the other one is fixed.The main difference between these two approaches is that the τ measure is independent of the number of co-clusters and thus τ CoClust can automatically determine the number of co-clusters.Another co-clustering formulation was presented in [4].Authors propose two different residue measures, and introduce their co-clustering algorithm which optimizes the sum-squared residues function.Contrary to the above mentioned approaches, the data addressed by this work is not limited to cooccurrences data.However, both residue measures require the number of desired clusters to be specified as a parameter.In [3] the authors propose a fully automatic cross-association framework.The proposed algorithm consists in optimizing an entropy-based objective function.Like τ CoClust, their approach is parameter free and it determines automatically the number of row clusters and column clusters.However, it is strongly limited by the fact that it can manage only binary matrices, while our proposed technique is designed to deal with both binary and counting/frequency data.In this paper, among other innovations, we have also considered a possible extension of τ CoClust to generate hierarchies.
Recently, Banerjee et al. have proposed in [2] a co-clustering setting based on matrix approximation.The approximation error is measured using a large class of loss functions called Bregman divergences.They introduce a metaalgorithm whose special cases include the algorithms from [7] and [4].Another recent and significant theoretical result has been presented in [1].The authors show that the co-clustering problem is NP-hard, and propose a constant-factor approximation algorithm for any norm-based objective function.
[6] presents SCOAL, a different co-clustering approach aiming at learning a regression model dedicated to collaborative filtering applications.In this work a hierarchical clustering is associated to the co-clustering method.However based on a divisive hierarchical clustering, to choose an adequate number of clusters for rows and columns.In our work, instead, we force the algorithm to produce the whole hierarchy, until each leaf cluster is a singleton.Our final goal, in fact, is to provide a complete taxonomy that can be browsed by any expert.Any common model selection approaches [12,35] can be applied as post-processing to cut or reduce the depth of the hierarchy for visualization purposes.Another significant difference between the two approaches is that our hierarchy is n-ary, while M-SCOAL always splits each cluster into two sub-clusters.
To the best of our knowledge, our approach is the first one that performs a simultaneous hierarchical co-clustering on both dimensions, and that returns two coupled hierarchies.However, in recent literature, several approaches have been proposed that could be related to our work, even though they do not produce the same type of results.In [18] a hierarchical co-clustering for queries and URLs of a search engine log is introduced.This method first constructs a bipartite graph for queries and visited URLs, and then all queries and related URLs are projected in a reduced dimensional space by applying singular value decomposition.Finally, all connected components are iteratively clustered using k-means for constructing a hierarchical categorization.In [5], the authors propose a hierarchical, model-based co-clustering framework that views a binary dataset as a joint probability distribution over row and column variables.Their approach starts by clustering tuples in a dataset, where each cluster is characterized by a different probability distribution.Then, the conditional distribution of attributes over tuples is exploited to discover natural co-clusters in the data.This method does not construct any coupled hierarchy; moreover, co-clusters are identified in a separate step, only after the set of tuples has been partitioned.In [17], the proposed method constructs two hierarchies on gene expression data, but they are not generated simultaneously.In our approach, levels of the two hierarchies are alternately generated, so that each level of both hierarchies identifies a strongly related set of co-clusters of the matrix.
Incremental clustering is an old and well studied problem in data mining [8].However, only few works address the problem of incremental co-clustering.In [11] the authors extend the co-clustering approach described in [2] in a straightforward way: they add the new object to a temporary cluster and then update the clustering statistics by assigning the new objects to existing clusters.The assumption here is that the number of co-clusters is always fixed, while our approach can possibly determine a different number of clusters.This work has been recently improved by [23] in which the authors suggest to estimate the clusters of the new incoming objects as soon as they arrive.They also propose an ensemble method for combining multiple local co-clustering results (with different number of clusters).Using this setting, however, the difference in term of computational time is not that marked between the offline and the on-line method.Both works have been applied to the area of collaborative filtering methods for recommender systems, where the matrix encodes users that rate some items.In this case, the incremental version may be also useful when existing users add new ratings for existing or new items.However, none of these two methods deals with hierarchies.An attempt of hierarchical clustering for text documents is [31], where the authors extends COBWEB [9], to deal with probability distribution of word occurrences in text documents.The authors first show that COBWEB is not suitable for this kind of data, and propose a variant which takes into account the word occurrence distribution.This algorithm, however, does not build hierarchies over the feature space.

Conclusion
Quality of flat clustering solutions in high-dimensional data results often degraded.In this paper we have proposed both a hierarchical co-clustering approach and its extension to an incremental setting.HiCC is a novel hierarchical algorithm, which builds two coupled hierarchies, one on the objects and one on features thus providing insights on both them.Hierarchies are high quality.We have validated them by objectives evaluation measures like NMI and Adjusted Rand Index on many high-dimensional datasets.In addition, HiCC has other benefits: it is parameter-less; it does not require a pre-specified number of clusters; it produces compact hierarchies because it makes n−ary splits, with n automatically determined.We empirically demonstrate that the concept hierarchies produced by our approach are stable w.r.t. the space given by the original class labels.As a second important result we have extended HiCC to an incremental setting.We have shown that the incremental variant produces good hierarchical co-clusterings, similar to those provided by the off-line approach, but requires significantly less computational time.This observation opens the way to a novel usage of our algorithms, in an incremental setting, in which the data are partitioned into several blocks and they are incrementally processed.Here, we conducted many experiments on text data.However our algorithm could be applied to other kind of data as well.For instance, in gene expression data analysis, biologists usually employ hierarchical clustering techniques for exploring data in both senses, but they usually cluster genes and samples separately, often leading to uncorrelated results.In the future, we will investigate new applications of our hierarchical co-clustering techniques in life science domains, such as genomics, proteomics and phylogenetics.Furthermore, we will study alternative measures that can be applied directly on data other than co-occurrence tables.
Algorithm 2 adopts a stochastic optimization technique for τ U|V .Starting from a given set of clusters, it allows to obtain another set of clusters that optimizes the objective function τ U|V .As first step, the algorithm randomly chooses a cluster u b from the set of the initial clusters U .Then it randomly picks an object o belonging to u b .It tries to move o from the original cluster u b to another cluster u e , including the possibility of formation of a new cluster with o.

Theorem 1
The iterated local search algorithm τ CoClust terminates in a finite number of steps and outputs a Pareto local optimum with respect to N (p).
we present a contingency table storing the distribution of the values for two categorical variables in a set of observations.The two categorical variables of the example are Job and Salary whose values have been categorized into salary levels.In this table, d ij denotes the frequency of observations in database examples having respectively the i-th value of the row variable (Salary), and the j-th value of the column variable Job.
Fig.1A contingency table of two categorical variables.
r ki ⊆ r k0j ; Two similar conditions must hold for C too.Moreover, for any pair of levels k ∈ [1, K] and l ∈ [1, L] determining, respectively, a partition R k = {r k1 , . . ., r k|R k | } of rows of D and a partition C l = {c l1 , . . ., c l|C l | } of columns of D, we define a contingency table T kl , i.e., a |R k | × |C l | matrix such that each element t kl ij ∈ T kl (i ∈ 1 . . .|R k | and j ∈ 1 . . .|C l |) is computed as specified in Section 2.

Table 1
Datasets characteristics -tr11, tr21: two samples from TREC dataset.These data come from the Text REtrieval Conference archive.-re0, re1: two samples from Reuters-21578 dataset.This dataset is widely used as test collection for text categorization research.

Table 2
represents a level of the hierarchy and each cell of the table represents a single cluster.We notice that the produced hierarchy is Row hierarchy of oh15

Table 3
Column hierarchy of re1

Table 4
Comparison between ITCC and HiCC with the same number of clusters for each level

Table 5
Complete view of performance for the top 8 levels of re1

Table 6
Average Goodness on the basis of τ S

Table 7
Mean depth of hierarchies

Table 8
Average and Std.Dev. for Cosine distance to evaluate the stability of the induced hierarchies

Table 9
Original Time Performance for each algorithmic component

Table 10
Mean number of clusters for various initial block sizes