DATA CLUSTERING A SEMINAR REPORT
science projects buddy Active In SP Posts: 278 Joined: Dec 2010 
19122010, 06:40 PM
SEMINAR REPORT
ON DATA CLUSTERING Submitted by NITIN PAUL Semester 7 Computer Science &Engineering COLLEGE OF ENGINEERING TRIVANDRUM 2010 ABSTRACT Data clustering also called Cluster Analysis, is defined as the unsupervised classification of data into various clusters. Webster defines cluster analysis as a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons ofmultiple characteristics. Clustering is a method of unsupervised learning under various Machine Learning Techniques.This topic has found application in numerous fields namely machine learning, data mining, pattern recognition, image analysis and bioinformatics.The topic is closely related to other similar techniques such as automatic classification, numerical taxonomy, botryology and typological analysis. Clustering is a very vast field and has found application in many different areas, yet there exists marked difference between it’s applications in the areas, and in general a unified clustering approach is hard to come by.Ideas, approaches and huerestics developed in a certain field is hard to migrate, or apply to another field. Hence we look at the various fields and the similarities between them. The seminar and presentation deals with cluster analysis in general, the paper explains the classification of the field,and the various approaches to cluster the data.The paper first defines the various steps involved in a clustering alogrithm, wherein the data is first standardized,and represented, then pattern are found and all the ground work for the actual clustering is done. The main two fields of clustering the paper focusses on is Heirarchical and Partitioning clustering. We also talk about other forms of clustering like the Nearest Neighbour, Fuzzy and Clustering using Artificial Nueral Networks.The paper also explains the various steps done after clustering which validate and strengthen the clustering algorithm. Even though clustering has found to be highly successfull we look at the challenges faced, wherein we find few of them to be really hard, for eg. we find out that there is no best clustering algorithm. The paper also shows a few application level problems reported in the other papers as a proof of significance of the topic. DATA CLUSTERING.docx (Size: 1.2 MB / Downloads: 235) I. INTRODUCTION A. Clustering Clustering is the process of organising data into meaningful groups, and these groups are called clusters.This is not something new to computer science, but has existed since long back as classification and taxonomy. Clustering groups(clusters) objects according to their percieved, or intrinsic similarity in their characteristics, Even though it is generally a field of unsupervised learning, knowledge about the type and source of the data has been found to be useful in both selecting the clustering algorithm, and for better clustering.This type of clustering generally aims to give tags to the clusters and has found use in fields like content mining. Clustering can be seen as a generalisation of classification. In classification we have the knowledge about both the object, the characteristics we are looking for and the classifications available. So classification is more similar to just finding“Where to put the new object in”. Clustering on the otherhand analyses the data and finds out the characteristics in it, either based on responses (supervised) or more generally without any responses (unsupervised). Clustering can also be seen as the reduction in the number of bits required to convey information about a member,so like so that we can say “Go past thethird tree and take a right“ instead of saying “go past the large green thing with red berries then past the large green thing with thorns and then ....and take a right”. We can see that there is much less information in the first definition but usually that is all that is required and not only is the extra information unnecessary it could also cause confusion. Hence we can see that clustering is a form of data abstraction. The most general definition is that given N items, can we divide the N items into k groups based on the measure of similiarity between the items, such that items in a group can be called ’similar’. For all purposes in the paper, we will be assuming that the N items are all points in a Mdimensional space, for the applications as defined below each of the axeses willbe taking a special meaning. Please note that this doesn’t mean that clusters are just the set of points that are closest together. We could have clusters which are lines, curves or even complex shapes, and spirals. In actual applications each of the axis would be recording a different quantifiable data, Eg : We could use color clusters where X  Axis is the red color, YAxis the Blue color and Zaxis the Green Color, and hence the points that can be seen together represent close enough colors. Lossy compression, clusters the group of points among which normal eye cannot distinguish and refers to all of them by a single color (usually the centroid of the colors). Data analysis is of two types mainly (i)exploratory or descriptive, meaning given a largeset of data he wants to find the general characterisitcs and models present in the data. (ii) confirmatory or inferential,where he will be given the characteristics be looking more into proving this item. We can see that clustering serves a formative process in the first model, where clustering itself will help form the high level model and the hypothesis, and in the second case even though not as strong as the first we can check the inclusion of new elements by checking if it would be included in the cluster or not Data clustering in 1,2 and 3 dimensions can be easily done by humans but as the number of dimesnions increases we find oursleves more and more in need of computers , and there exists problems that can be 8 – 10 dimensional. In machine learning there are problems which exists in infinte dimensions and so on, where the only approach would becluster analysis. The final note is that clustering just on the basis of physical closeness is not advisable. The figure from [Jain, 2010] given below shows the actual way clustering should happen, but note that if we go by physical closeness alone we will get erroneous results. Figure 1  The input and desired clustering B. Why Cluster ? We have had a brief overview of what clustering is, herewe answer the question of the use of clustering. We cansee that the basic defintion itself gives multiple uses for clustering, since it has been derived from the generalisation of classification, the most obvious use is classification. To name a few other important topics would be pattern analysis, grouping, decisionmaking, and machinelearning situations, including data mining, document retrieval, image segmentation, and pattern classification.Google Scholar search itself records over 2 million papers on the topic, and most of them being in implementation the relavance or need for the study of the topic is more than justified. Here we identify a few of the important areas. 1) Classification : Clustering helps in the classification of the data given, this helps us in easily predicting the information about a new entity based on the cluster it belongs to. A good clustering algorithm helps you to predict theproperties and characterisitcs of a object once we find that itbelongs to a particular groups. As [McKay,2003] says once we know that something is in the cluster of a tree, then it is much less likely to bite, and it is more useful if investigated by a botanist. 2) Lossy Compression: As explained before, clustering reduces the number of bits required to address a particularpoint, for eg. referring to the previous example of images,we can group pixels of similar color and closely situated to be a single cluster. We cannot perfectly retrieve the source information in the case of Lossy Compression but in case of objects such as images and music, we can restrict ourselves to clustering points in space which the humans can’t easily distinguish between, thus giving us roughly the same quality but at a much smaller file size. This also reduces the complexity of the information, thus enabling better recognition by computer. 3) Data Analysis: The internet is a vast pool of data, IDC GotoMarket services report that by 2011 the internet will have about 1.2 zettabytes of data, that is 1.2 terraterrabytes of data. Most of the data online but is not perfectly ordered and structed and hence any algorithm that intends, to search, catergorise or retrieve information from this would require to categorise them and for the same purpose we would need various clustering algorithms especially ones based on Natural Language Processing and Content Clustering.The clustering would also serve another purpose that is to assimilate the data, where algorithms hope to achieve the ability to understand the information about a topic and present it. Infact search engines like Clusty use clustering on data so that users are given a more ordered way to view the search results. 4) Image Processing: Image processing and Handwriting Recognition tools often use clustering for various purposes like Edge detection, and for comparisons. It is almost impossible to compare all the points in one particular image to the master image, instead we create clusters out of the source image and then compare it with the clusters of the master image giving us useful information about the level of similarity between the actual and the master image. It reduces the number of master images required, increases the speed and reduces the complexity. [Lee et.al,2008] has the wonderful example of the two pictures given below where the first one is a match and has more than 300 matching cluster points while the second one is not a match and hence has only 67 matching points. Clustering also helps avoid the effects due to resolution, noise and image quality in the image processing. Figure 2  The difference in matching cluster in similar and dissimilar images. The following are just a few of the applications, there are many more uses and going into all of them would not be advisable for the paper and hence we continue on to how clustering is done. II. PROCESS [Jain and Dubes, 1988] gives a nice outline for the various steps involved in any clustering algorithm . In general the clustering algorithm can be seen as a pipe model with feedback, that is the output of each phase is send as input for the next phase, and if and when required the more clustered data is send back, so that new patterns can be analysed. The various steps involved according to [Jain and Dubes, 1988] are 1) Pattern representation  This is the first step and defines what the pattern is, it can also incluse feature extraction and selection. The inclusion depends on the type of application and the data we know about the distribution.Pattern representation refers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. Some of this information may not be controllable by the Data Clustering practitioner. Feature selection is the process of identifying the most effective subset of the original features to use in clustering. Feature extraction is the use of one or more transformations of the input features to produce new salient features. Either or both of these techniques can be used to obtain an appropriate set of features to use in clustering. 2) Definition of Pattern Proximity  This is define appropriate to the data domain, and this is what will define how close the values are and whether they need to be clustered seperately or together, for e.g. all patterns along a line might be clustered together and a spiral cluster could be made of many simple curve. This takes two patterns as the function and considers the output of the distance function between them, we could have a variety of distance function defined like _ The Euclidean distance. _ The Manhattan distance _ The maximum norm (or infinity norm) _ The Mahalanobis distance. _ The angle between two vectors (Generally used for higher dimensional data) _ The Hamming distance (Usually used to see errors and text analysis) In addition to all this we could also define specific distance functions based on our input. 3) Clustering/Grouping  This is where the clustering is finally done this takes input of the patterns and the definition of pattern proximity based on these patterns and produces the output as distinct clusters. Even now the data is just a set of points and clusters, for some uses this will be the last step. The theoretical process of clustering ends here, but most models go on to define abstraction and finally get information out of them. 4) Data Abstraction  This is an optional step, but is usually done in almost all the algorithms. Here, simplicity is either from the perspective of automatic analysis (so that a machine can perform further processing efficiently) or it is humanoriented (so that the representation obtained is easy to comprehend and intuitively appealing). In the clustering context, a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the centroid [Diday and Simon 1976]. Hence, here the cluster is now defined in terms of a represenational data, and unwanted individual characterisitcs of the points are thrown away, this will cause just the wanted information to remain, and hence less storage space and complexity. 5) Assessment  In few cases there is the study of cluster tendency, wherein the input data are examined to see if there is any merit to a cluster analysis prior to one being performed. Generally but this has not found much use . Cluster validity is a post clustering analysis where assessments are objective, so that it can be automated and allows no ambiguity are performed to determine the input data are examined to see if there is any merit to a cluster analys is prior to one being performed. Generally but this has not found much use. Cluster validity is a post clustering analysis where assessments are objective, so that it can be automated and allows no ambiguity are performed to determine whether the output is meaningful. This is usually done in Machine Learning or similar situation where theclusters are tested for their accuracy by various meanssuch as Getting Feedback from the deployment (Search Results), or processsing with a different image file in image processing. The validation could be external comparison to a known structure, internal – analysisng the structure based on the required properties, and checking inherent appropriateness. Relative  where two clusterings from different algorithms are checked and compared. III. KNOWN PROBLEMS In spite of the prevalence of such a large number of clustering algorithms, and its success in a number of different applicationdomains, clustering remains a difficult problem. This can be attributed to the inherent vagueness in the definition of a cluster, and the difficulty in defining an appropriate similarity measure and objective function. The following fundamental challenges associated with clustering were highlighted in [Jain & Dubes, 1988], which are relevant even to this date. 1) What is a cluster?  Till date we have not been able to define a objective answer for this question, clusters are inherently an ambigous structure, and answering this required the knowledge of the data we are studying and hence it cannot be generalized. 2) Should the data be normalized?  Changes made to the data, could very well make significant differences, to the way in which the clustering is done and the output itself. 3) Does the data contain any outliers?  If we answer this question incorrectly, we would be meddling with the accuracy of the answer. If it contains outliers and we try to include them then we would unneccesarily be creating new clusters, or expanding existing clusters or even offseting the centroidal values in algorithms such as KMeans clustering. 4) How do we define the pairwise similarity?  It could well be that a pattern needs to be split and the parts can go to each of the other patterns it is associated with and hence finding the correct pairs and structures are really hard. In addition to these questions we have few even more significant questions all of which needs to be explained in more detail. A. How to represent the data Figure 3  The difference in clustering with respect data representation 3(a) as rectangular coordinates and 3(b) as eigen functions. Data representation is one of the most important factors that influence the performance of the clustering algorithm. If the representation (choice of features) is good, the clusters are likely to be compact and isolated and even a simple clustering algorithm such as Kmeans will find them. Unfortunately, there is no universally good representation; the choice of representation must be guided by the domain knowledge. For e.g : conisder a cluster of points in a very long straight line, here a 2  means algorithm will try to find the trisector or two of the quadsectors, but if we are to represent these points using the vector between any two of them, then we can see that all of them gets concentrated about a single point, and almost all clustering algorithms will give the result. Hence data representation can make a significant difference. Just like this a circle if defined in there of it’s eigen vector we can see that a circle can easily be considered as a single cluster, but without the transformation a 2Means algorithm will be just bisecting the point along a single line. The following figure in [Jain, 2010] gives a very good idea of the problem faced. B. Purpose of grouping This problem arises when we change the grouping mechanism involved, since the grouping defines the placement of an object on the plane, different grouping mechanisms will give different answers and many of them can be skewed such that proper clustering is not possible to be done. This causes the clustering to change with the changes in weightage given to different features in the data. C. Number of Clusters Automatically determining the number of clusters has been one of the most difficult problems in data clustering. Usually, clustering algorithms are run with different values of K; the best value of K is then chosen based on a criterion function. Figueiredo and Jain used the minimum message length (MML) criteria in conjunction with the Gaussian mixture model (GMM) to estimate K. Their approach starts with a large number of clusters, and gradually merges the clusters if this leads to a decrease in the MML criterion. Gap statistics [Tibshirani et al. , 2001] is another commonly used approach for deciding the number of clusters. The key assumption is that when dividing data into an optimal number of clusters, the resulting partition is most resilient to the random perturbation. In spite of these objectivecriteria, it is not easy to decide which value of K leads to more meaningful clusters. D. Cluster Validity Clustering algorithms tend to find clusters in the data irrespective of whether or not any clusters are present. Cluster validity refers to formal procedures that evaluate the results of cluster analysis in a quantitative and objective fashion [Jain & Dubes, 1988]. In fact, even before a clustering algorithm is applied to the data, the user should determine if the data even has a clustering tendency. Cluster validity indices can be defined based on three different criteria: internal, relative, and external [Jain & Dubes, 1988]. Indices based on internal criteria assess the fit between the structure imposed by the clustering algorithm (clustering) and the data using the data alone. Indices based on relative criteria compare multiple structures (generated by different algorithms) and decide which of them is better in some sense. External indices measure the performance by matching cluster structure to known information about the model. External validation is something that can only be used whenever to check a new algorithm or so on. Since we need a already clustered data set to compare with, in most cases external validation makes no sense. The notion of cluster stability is appealing as an internal stability measure. Cluster stability is measured as the amount of variation in the clustering solution over different subsamples drawn from the input data. Different measures of variation can be used to obtain different stability measures. The performance of this classifier on the testing subset(s) indicates the stability of the clustering algorithm. In model based algorithms (e.g., centroid based representation of clusters in Kmeans, or Gaussian Mixture Models), the distance between the models found for different subsamples can be used to measure the stability. It can also be defined as the generalization ability of a clustering algorithm (in PACBayesian sense). They argue that since many algorithms can be shown to be asymptotically stable, the rate at which the asymptotic stability is reached with respect to the number of samples is a more useful measure of cluster stability. E. Why clustering algorithm to use ? Different clustering algorithms often result in entirely different partitions even on the same data. An interesting question is to identify algorithms that generate similar partitions irrespective of the data. In other words, can we cluster the clustering algorithms? Jain [Jain et al. , 2004] clustered 35 different clustering algorithms into 5 groups based on their partitions on 12 different datasets. The similarity between the clustering algorithms is measured as the averaged similarity between the partitions obtained on the 12 datasets. Clustering algorithms can also be compared at the theoretical level based on their objective functions. In order to perform such a comparison, a distinction should be made between a clustering method and a clustering algorithm [Jain & Dubes, 1988]. A clustering method is a general strategy employed to solve a clustering problem. A clustering algorithm, on the other hand, is simply an instance of a method. For instance, minimizing the squared error is a clustering method, and there are many different clustering algorithms, including Kmeans, that implement the minimum squared error method. Some equivalence relationships even between different clustering methods have been shown. We can finally say that there is no best clustering algorithm. Each clustering algorithm imposes a structure on the data either explicitly or implicitly. When there is a good match between the model and the data, good partitions are obtained. Since the structure of the data is not known before, one needs to try competing and diverse approaches to determine an appropriate algorithm for the clustering task at hand. This idea of no best clustering algorithm is partially captured by the impossibility theorem [Kleinberg, 2002], which states that no single clustering algorithm simultaneously satisfies the three basic axioms of data clustering, i.e., scale invariance ,consistency, and richness. IV. NOTATIONS USED  A pattern (or feature vector, observation, or datum) X is a single data item used by the clustering algorithm. It typically consists of a vector of d measurements: X = fx1; x2; :::xdg  The individual scalar components xi of a pattern X are called features (or attributes).  D is the dimensionality of the pattern or of the pattern space.  A pattern set is denoted _ = fX1; :::Xng. The ith pattern in _ is denoted Xi = fxi;1; :::; xi;dg. In many cases a pattern set to be clustered is viewed as an n*d pattern matrix. A class, in the abstract, refers to a state of nature that governs the pattern generation process in some cases. More concretely, a class can be viewed as a source of patterns whose distribution in feature space is governed by a probability density specific to the class. Clustering techniques attempt to group patterns so that the classes thereby obtained reflect thedifferent pattern generation processes represented in the pattern set. Hard clustering techniques assign a class label li to each patterns Xi, identifying its class. The set of all labels for a pattern set _ is L = f?l1; :::; lng with li 2 f1; ???; kg where k is the number of clusters. Fuzzy clustering procedures assign to each input pattern Xi a fractional degree of membership fij in each output cluster j. A distance measure (a specialization of a proximity measure) is a metric (or quasimetric) on the feature space used to quantify the similarity of patterns. V. PREPROCESSING There are no theoretical guidelines that suggest the appropriate patterns and features to use in a specific situation. Indeed, the pattern generation process is often not directly controllable; the user?s role in the pattern representation process is to gather facts and conjectures about the data,optionally perform feature selection and extraction, and design the subsequent elements of the clustering system. Because of the difficulties surrounding pattern representation, it is conveniently assumed that thepattern representation is available prior to clustering. Nonetheless, a careful investigation of the available features and any available transformations (even simple ones) can yield significantly improved clustering results. A good pattern representation can often yield a simple and easily understood clustering; a poor pattern representation may yield a complex clustering whose true structure is difficult or impossible to discern .A pattern can measure either a physical object (e.g., a chair) or an abstract notion (e.g., a style of writing). As noted above, patterns are represented conventionally as multidimensional vectors,where each dimension is a single feature. These features can be either quantitative or qualitative. For example, if weight and color are the two features used, then? (20, black) is the representation of a black object with 20 units of weight. The features can be subdivided into the following types 1) Quantitative features a) continuous values (e.g., weight) b) discrete values (e.g., the number of computers) c) interval values (e.g., the dura tion of an event). 2) Qualitative features a) nominal or unordered (e.g., color) b) ordinal (e.g., military rank or qualitative evaluations of temperature) Quantitative features can be measured on a ratio scale (with a meaningful reference value, such as temperature), or on nominal or ordinal scales. One can also use structured features which are represented as trees, where the parent node represents a generalization of its child nodes. For example, a parent node vehicle may be a generalization of children labeled cars, buses, trucks,and motorcycles Further,the node cars could be a generalization of cars of the type Toyota, Ford, Benz, etc. A generalized representation of patterns, called symbolic objects was proposed in Diday . Symbolic objects are defined by a logical conjunction of events. These events link values and features in which the features can take one or more values and all the objects need not be defined on the same set of features.It is often valuable to isolate only the most descriptive and discriminatory features in the input set, and utilize those features exclusively in subsequent analysis. Feature selection techniques identify a subset of the existing features for subsequent use, while feature extraction techniques compute new features from the original set. In either case, the goal is to improve classification performance and/or computational efficiency. Feature selection is a wellexplored topic in statistical pattern recognition ; however, in a clustering context (i.e., lacking class labels for patterns), the feature selection process is of necessity ad hoc, and might involve a trialanderror process where various subsets of features are selected,the resulting patterns clustered, and the output evaluated using a validity index. In contrast, some of the popular feature extraction processes do not depend on labeled data and can be used directly. Reduction of the number of features has an additional benefit, namely the ability to produce output that can be visually inspected by a human. VI. SIMILARITY Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. Because of the variety of feature types and scales, the distance measure (or measures) must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the wellknown distance measures used for patterns whose features are all continuous.The most popular metric for continuous features is the Euclidean distance. Patterns can also be represented using string or tree structures. Strings are used in syntactic clustering, but since syntactic methods are inferior in every aspect. Therefore, we do not consider syntactic methods. It is possible to make any two arbitrary patterns equally similar by encoding them with a sufficiently large number of features. As a consequence, any two arbitrary patterns are equally similar, unless we use some additional domain information. VII. CLUSTER TECHNIQUES Different approaches to clustering datacan be described as follows _ Hierarchical – Single Link – Complete Link _ Partitional – Square Error (includes KMeans) – Graph Theoritic – Mixture Resolving (includes Expectation Maximization – Mode Seeking is based on the discussion in Jain and Dubes [1988]). At the top level, there is a distinction between hierarchical and partitional approaches (hierarchical methods produce a nested series of partitions, while partitional methods produce only one). The taxonomy has to be supplemented by a discussion of crosscutting issues that may (in principle) affect all of the different approaches regardless of their placement in the taxonomy. _ Agglomerative vs. divisive: This aspect relates to algorithmic structure and operation. An agglomerative ap proach begins with each pattern in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met. _ Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the clustering process. Most algorithms are polythetic; that is, all features enter into the computation of distances between patterns, and decisions are based on those distances. A simple monothetic algorithm considers features sequentially to divide the given collection of patterns. This is illustrated in following figure 7. Figure 7 Here, the collection is divided into two groups using feature x1; the vertical broken line V is the separating line. Each of these clusters is further divided independently using feature x2, as depicted by the broken lines H1 and H2 . The major problem with this algorithm is that it generates 2D clusters where d is thedimensionality of the patterns. For large values of d (d _ 100 is typical in information retrieval applications) the number of clusters generated by this algorithm is so large that the data set is divided into uninterestingly small and fragmented clusters. _ Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to the cluster with the largest measure of membership. _ Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to optimize a squared error function. This optimization can be accomplished using traditional techniques or through a random search of the state space consisting of all possible labelings. _ Incremental vs. nonincremental: This issue arises when the pattern set to be clustered is large, and constraints on execution time or memory space affect the architecture of the algorithm. The early history of clustering methodology does not contain many examples of clustering algorithms designed to work with large data sets, but the advent of data mining has fostered the development of clustering algorithms that minimize the number of scans through the pattern set, reduce the number of patterns examined during execution, or reduce the size of data structures used in the algorithm’s operations. The the specification of an algorithm for clustering usually leaves considerable flexibilty in implementation. A. Hierarchical Clustering The operation of a hierarchical clustering algorithm is illustrated using the twodimensional data set in Figure 9. This figure depicts seven patterns labeled A, B, C, D, E, F, and G in three clusters. A hierarchical algorithm yields a dendrogram representing the nested grouping of patterns and similarity levels at which groupings change. A dendrogram corresponding to the seven points in Figure 8 (obtained from the singlelink algorithm [Jain and Dubes 1988]) is shown in Figure 9. The dendrogram can be broken at different levels to yield different clusterings of the data. Figure 8 Figure 9 Most hierarchical clustering algorithms are variants of the singlelink, completelink, and minimumvariance algorithms. Of these, the singlelink and completelink algorithms are most popular. These two algorithms differ in the way they characterize the similarity between a pair of clusters. In the singlelink method, the distance between two clusters is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second). In the completelink algorithm,the distance between two clusters is the maximum of all pairwise distances between patterns in the two clusters. In either case, two clusters are merged to form a larger cluster based on minimum distance criteria. The completelink algorithm produces tightly bound or compact clusters. The singlelink algorithm, by contrast, suffers from a chaining effect [Nagy 1968]. It has a tendency to produce clusters that are straggly or elongated. There are two clusters in Figures 10 and 11 separated by a ?bridge? of noisy patterns. The singlelink algorithm produces the clusters shown in Figure 10,whereas the completelink algorithm obtains the clustering shown in Figure 11. The clusters obtained by the completelink algorithm are more compact than those obtained by the singlelink algorithm; the cluster labeled 1 obtained using the singlelink algorithm is elongated because of the noisy patterns labeled . The singlelink algorithm is more versatile than the completelink algorithm, otherwise. However, from a pragmatic viewpoint, it has been observed that the complete link algorithm produces more useful hierarchies in many applications than the singlelink algorithm . Figure 11 & 12  11 shows the single link result and 12 shows the complete link result 1) Agglomerative SingleLink Clustering Algorithm: 1) Place each pattern in its own cluster. Construct a list of interpattern distances for all distinct unordered pairs of patterns, and sort this list in ascending order. 2) Step through the sorted list of distances, forming for each distinct dissimilarity value Dk a graph on the patterns where pairs of patterns closer than Dk are connected by a graph edge. If all the patterns are members of a connected graph, stop. Otherwise, repeat this step. 3) The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level forming a partition (clustering) identified by simply connected components in the corresponding graph. 2) Agglomerative CompleteLink Clustering Algorithm: 1) Place each pattern in its own cluster. Construct a list of interpattern distances for all distinct unordered pairs of patterns, and sort this list in ascending order. 2) Step through the sorted list of distances, forming for each distinct dissimilarity value Dk a graph on the patterns where pairs of patterns closer than Dk are connected by a graph edge. If all the patterns are members of a completely connected graph, stop. 3) The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level forming a partition (clustering) identified by completely connected components in the corresponding graph. Hierarchical algorithms are more versatile than partitional algorithms. For example, the singlelink clustering algorithm works well on data sets containing nonisotropic clusters including wellseparated, chainlike, and concentric clusters, whereas a typical partitional algorithm such as the k –means algorithm works well only on data sets having isotropic clusters. On the other hand, the time and space complexities of the partitional algorithms are typically lower than those of the hierarchical algorithms. It is possible to develop hybrid algorithms that exploit the good features of both categories. 3) Hierarchical Agglomerative Clustering Algorithm: 1) Compute the proximity matrix containing the distance between each pair of patterns. Treat each pattern as a cluster. 2) Find the most similar pair of clusters using the proximity matrix. Merge these two clusters into one cluster. Update the proximity matrix to reflect this merge operation. 3) If all patterns are in one cluster,stop. Otherwise, go to step 2. Based on the way the proximity matrix is updated in step 2, a variety of agglomerative algorithms can be designed. Hierarchical divisive algorithms start with a single cluster of all the given objects and keep splitting the clusters based on some criterion to obtain a partition of singleton clusters. B. Partitional Algorithms A partitional clustering algorithm obtains a single partition of the data instead of a clustering structure, such as the dendrogram produced by a hierarchical technique. Partitional methods have advantages in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive. A problem accompanying the use of a partitional algorithm is the choice of the number of desired output clusters. A seminal paper [Dubes 1987] provides guidance on this key design decision. The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of the patterns) or globally (defined over all of the patterns). Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all of the runs is used as the output clustering. 1) Squared Error Algorithms: The most intuitive and frequently used criterion function in partitional clustering techniques is the squared error criterion, which tends to work well with isolated and compact clusters. The k means is the simplest and most commonly used algorithm employing a squared error criterion.It starts with a random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until a convergence criterion is met (e.g.,there is no reassignment of any pattern from one cluster to another, or the squared error ceases to decrease significantly after some number of iterations).The k means algorithm is popular because it is easy to implement, and its time complexity is O(n), where n is the number of patterns. A major problem with this algorithm is that it is sensitive to the selection of theinitial partition and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen. Figure 12 Here we can see two configurations each of which is the result of a different starting permutation. Figure 12 shows seven twodimensional patterns. If we start with patterns A, B,and C as the initial means around which the three clusters are built, then we end up with the partition A, B,C, D, E, F, G shown by ellipses. The suared error criterion value is much larger for this partition than for the best partition A, B, C, D, E, F, G shown by rectangles, which yields the global minimum value of the squared error criterion function for a clustering containing three clusters. The correct threecluster solution is obtained by choosing, for example, A, D, and F as the initial cluster means. Squared Error Clustering Method _ Select an initial partition of the patterns with a fixed number of clusters and cluster centers. _ Assign each pattern to its closest cluster center andcompute the new cluster centers as the centroids of theclusters. Repeat this step until convergence is achieved,i.e., until the cluster membership is stable. _ Merge and split clusters based on some heuristicinformation,optionally repeating step 2. k Means Clustering Algorithm _ Choose k cluster centers to coincide with k randomlychosen patterns or k randomly defined points inside thehypervolume containing the pattern set. _ Assign each pattern to the closest cluster center. _ Recompute the cluster centers using the current cluster memberships. _ If a convergence criterion is not met,go to step 2. Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error. Several variants of the k means algorithm have been reported in the literature. Some of them attempt to select a good initial partition so that the algorithm is more likely to find the global minimum value. Another variation is to permit splitting and merging of the resulting clusters. Typically, a cluster is split when its variance is above a prespecified threshold, and two clusters are merged when the distance between their centroids is below another prespecified threshold. Using this variant, it is possible to obtain the optimal partition starting from any arbitrary initial partition, provided proper threshold values are specified. The wellknown ISODATA algorithm employs this technique of merging and splitting clusters. If ISODATA is given the ?ellipse? partitioning shown in Figure 12 as an initial partitioning, it will produce the optimal three cluster partitioning. ISODATA will first merge the clusters A and B,C into one cluster because the distance between their centroids is small and then split the cluster D,E,F,G, which has a large variance, into two clusters D,E and F,G.Another variation of the k means algorithm involves selecting a different criterion function altogether. The dynamic clustering algorithm (which permits representations other than the centroid for each cluster) was proposed and describes a dynamic clustering approach obtained by formulating the clustering problem in the framework of maximumlikelihood estimation. The regularized Mahalanobis distance was used in Mao and Jain [1996] to obtain hyperellipsoidal clusters. C. Graph Theoretic Clustering. The bestknown graphtheoretic divisive clustering algorithm is based on construction of the minimal spanning tree (MST) of the data , and then deleting the MST edges with the largest lengths to generate clusters. Figure 13 Clustering using Minimum Spanning Tree Figure 13 depicts the MST obtained from nine twodimensional points. By breaking the link labeled CD with a length of 6 units (the edge with the maximum Euclidean length), two clusters (A, B, Cand D, E, F, G, H, I) are obtained. The second cluster can be further divided into two clusters by breaking the edge EF, which has a length of 4.5 units. The hierarchical approaches are also related to graphtheoretic clustering.Singlelink clusters are subgraphs of the minimum spanning tree of the data which are also the connected components. Completelink clusters are maximal complete subgraphs, and are related to the node colorability of graphs. The maximal complete subgraph was considered the strictest definition of a cluster. D. MixtureResolving and ModeSeeking Algorithms The mixture resolving approach to cluster analysis has been addressed in a number of ways. The underlying assumption is that the patterns to be clustered are drawn from one of several distributions, and the goal is to identify the parameters of each and (perhaps) their number. Most of the work in this area has assumed that the individual components of the mixture density are Gaussian, and in this case the parameters of the individual Gaussians are to be estimated by the procedure. Traditional approaches to this problem involve obtaining (iteratively) a maximum likelihood estimate of the parameter vectors of the component densities [Jain and Dubes 1988]. More recently, the Expectation Maximization (EM) algorithm (a general purpose maximum likelihood algorithm for missingdata problems) has been applied to the problem of parameter estimation. In the EM framework, the parameters of the component densities are unknown, as are the mixing parameters, and these are estimated from the patterns. The EM procedure begins with an initial estimate of the parameter vector and iteratively rescores the patterns against the mixture density produced by the parameter vector. The rescored patterns are then used to update the parameter estimates. In a clustering context, the scores of the patterns (which essentially measure their likelihood of being drawn from particular components of the mixture) can be viewed as hints at the class of the pattern. Those patterns, placed (by their scores) in a particular component, would therefore be viewed as belonging to the same cluster. Nonparametric techniques for densitybased clustering have also been developed [Jain and Dubes 1988]. Inspired by the Parzen window approach to nonparametric density estimation, the corresponding clustering procedure searches for bins with large counts in a multidimensional histogram of the input pattern set. Other approaches include the application of another partitional or hierarchical clustering algorithm using a distance measure based on a nonparametric density estimate. E. Nearest Neighbour Clustering Since proximity plays a key role in our intuitive notion of a cluster, nearestneighbor distances can serve as the basis of clustering procedures. An iterative procedure assigns each unlabeled pattern to the cluster of its nearest labeled neighbour pattern, provided the distance to that labeled neighbor is below a threshold. The process continues until all patterns are labelled or no additional labelings occur. The mutual neighbourhood value (described earlier in the context of distance computation) can also be used to grow clusters from near neighbors. F. Fuzzy Clustering Traditional clustering approaches generate partitions; in a partition, each pattern belongs to one and only one cluster. Hence, the clusters in a hard clustering are disjoint. Fuzzy clustering extends this notion to associate each pattern with every cluster using a membership function. The output of such algorithms is a clustering,but not a partition. We give a high level partitional fuzzy clustering algorithm below. 1) Select an initial fuzzy partition of the N objects into K clusters by selecting the N _K membership matrix U. An element uij of this matrix represents the grade of membership of object Xi in cluster Cj.Typically, uij 2 [0; 1]. 2) Using U, find the value of a fuzzy criterion function. 3) Repeat step 2 until entries in U do not change significantly. In fuzzy clustering, each cluster is a fuzzy set of all the patterns. Figure 14 illustrates the idea. The rectangles enclose two ?hard? clusters in the data H1=1,2,3,4,5 and H2=6,7,8,9.A fuzzy clustering algorithm might pro duce the two fuzzy clusters F1 and F2 depicted by ellipses. The patterns will have membership values in [0,1] for each cluster. For example, fuzzy cluster F1 could be compactly described as (1,0.9),(2,0.8),(3,0.7),(4,0.6),(5,0.55),(6,0.2),(7,0.2), (8,0.0),(9,0.0) and F2 similarly. The ordered pair (i; _i) in each cluster represent the ith pattern and its membership value to the cluster _i . Larger membership values indicate higher confidence in the assignment of the pattern to the cluster. A hard clustering can be obtained from a fuzzy partition by thresholding the membership value. The most popular fuzzy clustering algorithm is the fuzzy c means (FCM) algorithm. Even though it is better than the hard k –means algorithm at avoiding local minima, FCM can still converge to local minima of the squared error criterion. The design of membership functions is the most important problem in fuzzy clustering; different choices include those based on similarity decomposition and centroids of clusters. G. Artifical Neural Networks Artificial neural networks (ANNs) are motivated by biological neural networks. ANNs have been used extensively over the past three decades for both classification and clustering . Some of the features of the ANNs that are important in pattern clustering are _ ANNs process numerical vectors and so require patterns to be represented using quantitative features only. _ ANNs are inherently parallel and distributed processing architectures. _ ANNs may learn their interconnection weights adaptively. More specifically, they can act as pattern normalizers and feature selectors by appropriate selection of weights. Competitive (or winner?take?all) neural networks [Jain and Mao 1996] are often used to cluster input data. In competitive learning, similar patterns are grouped by the network and represented by a single unit (neuron). This grouping is done automatically based on data correlations. Wellknown examples of ANNs used for clustering include Kohonen?s learning vector quantization (LVQ) and selforganizing map (SOM) [Kohonen 1984], and adaptive resonance theory models [Carpenter and Grossberg 1990]. The architectures of these ANNs are simple: they are single layered. Patterns are presented at the input and are associated with the output nodes. The weights between the input nodes and the output nodes areiteratively changed (this is called learning) until a termination criterion is satisfied. Competitive learning has been found to exist in biological neural networks. However, the learning or weight update procedures are quite similar to those in some classical clustering approaches.The learning algorithm in ART models is similar to the leader clustering algorithm. The SOM gives an intuitively appealing twodimensional map of the multidimensional data set, and it has been successfully used for vector quantization and speech recognition [Kohonen 1984]. However, like its sequential counterpart, the SOM generates a suboptimal partition if the initial weights are not chosen properly. Further, its convergence is controlled by various parameters such as the learning rate and a neighborhood of the winning node in which learning takes place. It is possible that a particular input pattern can fire different output units at different iterations; this brings up the stability issue of learning systems. The system is said to be stable if no pattern in the training data changes its category after a finite number of learning iterations. This problem is closely associated with the problem of plasticity, which is the ability of the algorithm to adapt to new data. For stability, the learning rate should be decreased to zero as iterations progress and this affects the plasticity. The ART models are supposed to be stable and plastic. However, ART nets are orderdependent; that is, different partitions are obtained for different orders in which the data is presented to the net. Also, the size and number of clusters generated by an ART net depend on the value chosen for the vigilance threshold, which is used to decide whether a pattern is to be assigned to one of the existing clusters or start a new cluster. Further, both SOM and ART are suitable for detecting only hyperspherical clusters. All these ANNs use a fixed number of output nodes which limit the number of clusters that can be produced. VIII. APPLICATION EXAMPLES A. Information Retrival Information retrieval (IR) is concerned with automatic storage and retrieval ofdocuments. Many university libraries use IR systems to provide access to books, journals, and other documents. Libraries use the Library of Congress Classification (LCC) scheme for efficient storage and retrieval of books. The LCC scheme consists of classes labeled A to Z [LC Classification Outline 1990] which are used to characterize books belonging to different subjects. For example, label Q corresponds to books in the area of science, and the subclass QA is assigned to mathematics. Labels QA76 to QA768 are used for classifying books related to computers and other areas of computer science. There are several problems associated with the classification of books using the LCC scheme. Some of these are listed below: 1) When a user is searching for books in a library which deal with a topic of interest to him, the LCC number alone may not be able to retrieve all the relevant books. This is because the classification number assigned to the books or the subject categories that are typically entered in the database do not contain sufficient information regarding all the topics covered in a book. To illustrate this point, let us consider the book Algorithms for Clustering Data by Jain and Dubes [1988]. Its LCC number is ?QA 278.J35?. In this LCC number, QA 278 corresponds to the topic ?cluster analysis?, J corresponds to the first author?s name and 35 is the serial number assigned by the Library of Congress. The subject categories for this book provided by the publisher (which are typically entered in a database to facilitate search) are cluster analysis, data processing and algorithms. There is a chapter in this book [Jain and Dubes 1988] that deals with computer vision, image processing, and image segmentation. So a user looking for literature on computer vision and, in particular, image segmentation will not be able to access this book by searching the database with the help of either the LCC number or the subject categories provided inthe database. The LCC number for computer vision books is TA 1632 which is very different from the number QA 278.J35 assigned to this book. 2) There is an inherent problem in assigning LCC numbers to books in a rapidly developing area. For example, let us consider the area of neural networks. Initially, category ?QP?in LCC scheme was used to label books and conference proceedings in this area. For example, Proceedings of the International Joint Conference on Neural Networks [IJCNN?91] was assigned the number ?QP 363.3?. But most of the recent books on neural networks are given a number using the category label ?QA?; Proceedings of the IJCNN?92 [IJCNN?92] is assigned the number ?QA 76.87?. Multiple labels for books dealing with the same topic will force them to be placed on different stacks in a library. Hence, there is a need to update the classification labels from time to time in an emerging discipline. 3) Assigning a number to a new book is a difficult problem. A book may deal with topics corresponding to two or more LCC numbers, and therefore, assigning a unique number to such a book is difficult. B. Pattern Recognition Each book is represented as a generalized list of these strings using the ACM CR classification tree. For the sake of brevity in representation, the fourthlevel nodes in the ACM CR classification tree are labeled using numerals 1 to 9 and characters A to Z. For example, the children nodes of I.5.1 (models) are labeled I.5.1.1 to I.5.1.6. Here, I.5.1.1 corresponds to the node labeled deterministic, and I.5.1.6 stands for the node labeled structural. In a similar fashion, all the fourthlevel nodes in the tree can be labeled as necessary. From now on, the dots in between successive symbols will be omitted to simplify the representation. For example, I.5.1.1 will be denoted as I511. We illustrate this process of representation with the help of the book by Jain and Dubes [1988]. There are five chapters in this book. For simplicity of processing, we consider only the information in the chapter contents. There is a single entry in the table of contents for chapter 1, ?Introduction,? and so we do not extract any keywords from this. Chapter 2, labeled ?Data Representation,? has section titles that correspond to the labels of the nodes in the ACM CR classification tree. Based on the above analysis, Chapter 2 of Jain and Dubes [1988] can be characterized by the weighted disjunction ((I522 ? I532 ? I515)(1,4)). The weights(1,4) denote that it is one of the four chapters which plays a role in the representation of the book. C. Image Segmentation Image segmentation is a fundamental component in many computer vision.applications, and can be addressed as a clustering problem. The segmentation of the image(s)presented to an image analysis system is critically dependent on the scene to be sensed, the imaging geometry, configuration, and sensor used to transduce the scene into a digital image, and ultimately the desired output (goal) of the system. The applicability of clustering meth odology to the image segmentation problem was recognized over four decades ago, and the paradigms underlying the initial pioneering efforts are still in use today. A recurring theme is to define feature vectors at every image location (pixel) composed of both functions of image intensity and functions of the pixel location itself. D. Data Mining Data mining has often been performed on transaction and relational databases which have welldefined fields which can be used as features, but there has been recent research on large unstructured databases such as the World Wide Web .There have been attempts to classify Web documents using words or functions of words as features. However, relatively small sets of labeled training samples and very large dimensionality limit the ulti mate success of automatic Web document categorization based on words as features. Rather than grouping documents in a word feature space, Wulfekuhler and Punch [1997] cluster the words from a small collection of World Wide Web documents in the document space. The sample data set consisted of 85 documents from the manufacturing domain in 4 different userdefined categories (labor, legal, government, and design).These 85 documents contained 5190 distinct word stems after common words (the, and, of) were removed. Since the words are certainly not uncorrelated,they should fall into clusters where words used in a consistent way across the document set have similar values of frequency in each document.K means clustering was used to group the 5190 words into 10 groups. One surprising result was that an average of 92% of the words fell into a single cluster, which could then be discarded for data mining purposes. The smallest clusters contained terms which to a human seem semantically related. Terms which are used in ordinary contexts, or unique terms which do not occur often across the training document set will tend to cluster into the arge 4000 member group. This takes care of spelling errors, proper names which are infrequent, and terms which are used in the same manner throughout the entire document set. Terms used in specific contexts (such as file in the context of filing a patent, rather than a computer file) will appear in the documents consistently with other terms appropriate to that context (patent, invent) and thus will tend to cluster together. Among the groups of words, unique contexts stand out from the crowd. After discarding the largest cluster, the smaller set of features can be used to construct queries for seeking out other relevant documents on the Web using standard Web searching tools.Searching the Web with terms taken from the word clusters allows discovery of finer grained topics (e.g., family medical leave) within the broadly defined categories (e.g., labor). IX. CONCLUSION In this paper we studied an introduction to cluster algorithms, and found the ways in which we could apply the same. We studies a general overview of each of the important methods. We also looked into the currently existing problems in the field and noticed that there is still a lot of work that need to be done in the field and was amazed by the depth of the field, finally we also looked into a few applications of the clustering algorithms. BIBLIOGRAPHY 1.[Jain, 2010] JAIN ANIL K.,Data Clustering: 50 Years Beyond KMeans [Jain & Dubes, 1988] JAIN, ANIL K., & DUBES, RICHARD C. 1988. Algorithms for clustering data. Prentice Hall. 2.[Jain et al. , 2004 ] JAIN, A. K., TOPCHY, A., LAW, M. H. C., & BUHMANN, J. M. 2004.Landscape of clustering algorithms. Pages 260?263 of: Proceedings of the international conference on pattern recognition, vol. 1. 3.[Law et al. , 2005 ] LAW, MARTIN, TOPCHY, ALEXANDER, & JAIN, A. K. 2005. Modelbased clustering with probabilistic constraints. Pages 641?645 of: Proc. Siam conference on data mining 4.[Tukey, 1977 ] TUKEY, JOHN WILDER. 1977. Exploratory data analysis. AddisonWesley. 5.[Lee et al. , 2008 ] LEE, JUNGEUN, JAIN, ANIL K., & JIN, RONG. 2008. Scars, marks and tattoos (SMT): Soft biometric for suspect and victim identification. In: Proceedings of the biometric symposium. 6.[Tibshirani et al. , 2001 ]TIBSHIRANI, R., WALTHER, G., , & HASTIE, T. 2001.Estimating the number of clusters in a data set via the gap statistic. Journal of the royalstatistical society, B, 411?423. 7.[Jain & Flynn, 1996] JAIN, ANIL K., & FLYNN, P. 1996. Advances in image understanding. IEEE Computer Society Press. Chap. Image segmentation using clustering, pages 65 to 83. Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion



praveenakumari.n Active In SP Posts: 1 Joined: Feb 2011 
09022011, 03:23 PM
pls send the ppt of this report..[Data clustering]



