lect-5-13下载_PPT模板_34

is_362582

暂无简介

lect-5-13nullChapter 3 Basic Data Mining Techniques*Chapter 3 Basic Data Mining Techniques3.3 The K-Means Algorithm (For cluster analysis)*AI&DM BUPT1. What is Cluster Analysis (clustering) ?*1. What is Cluster Analysis (clustering) ?Cluster (簇): a collection of data objec...

nullChapter 3 Basic Data Mining Techniques*Chapter 3 Basic Data Mining Techniques3.3 The K-Means Algorithm (For cluster analysis)*AI&DM BUPT1. What is Cluster Analysis (clustering) ?*1. What is Cluster Analysis (clustering) ?Cluster (簇): a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters High quality clusters: high intra-class similarity low inter-class similarity Cluster analysis (聚类分析) Grouping a set of data objects into clusters Clustering is unsupervised learning (unsupervised classification): no predefined classes. It is a form of learning by observation, rather than learning by examples. Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms*AI&DM BUPTExamples of Clustering Applications (I)*Examples of Clustering Applications (I)聚类分析在客户细分中的应用　　　　消费同一种类的商品或服务时，不同的客户有不同的消费特点，通过研究这些特点，企业可以制定出不同的营销组合，从而获取最大的消费者剩余，这就是客户细分的主要目的。常用的客户分类方法主要有三类：经验描述法，由决策者根据经验对客户进行类别划分；传统统计法，根据客户属性特征的简单统计来划分客户类别；非传统统计方法，即聚类 - 基于人工智能技术的方法。*AI&DM BUPTExamples of Clustering Applications (II)*Examples of Clustering Applications (II)Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house types, values, and geographical locations*AI&DM BUPTExample*Example*AI&DM BUPTExample*Example*AI&DM BUPTExample*Example*AI&DM BUPTExample*Example*AI&DM BUPT2. The K-Means Algorithm*2. The K-Means AlgorithmChoose a value for K, the total number of clusters. Randomly choose K points as cluster centers. Assign the remaining instances to their closest cluster center. Calculate a new cluster center for each cluster. Repeat steps 3-5 until the cluster centers do not change.*AI&DM BUPTThe K-Means Clustering Algorithm*The K-Means Clustering AlgorithmExample*AI&DM BUPTnull**AI&DM BUPTnull**AI&DM BUPTnull**AI&DM BUPT3. General Considerations*3. General Considerations Requires real-valued data. The number of clusters present in the data, is selected by human. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities. *AI&DM BUPT4. Types of data in clustering analysis*4. Types of data in clustering analysis4.1 Interval-scaled variables: 4.2 Binary variables: 4.3 Nominal and ordinal variables: 4.4 Variables of mixed types:Distance is normally used to measure the similarity or dissimilarity between two data objects*AI&DM BUPT4.1 Interval-valued variables （间隔值变量）*4.1 Interval-valued variables （间隔值变量）If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance: Requirements for distance function d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)*AI&DM BUPT4.1 Interval-valued variables (Cont. 1)*4.1 Interval-valued variables (Cont. 1)Some popular measures include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer*AI&DM BUPT4.1 Interval-valued variables (cont. 2)*4.1 Interval-valued variables (cont. 2)Standardize data Find out the mean: Calculate the mean absolute deviation (绝对偏差均值): Calculate the standardized measurement (z-score) *AI&DM BUPT4.2 Binary Variables (二值变量)*4.2 Binary Variables (二值变量)A contingency table （相依表）for binary data Simple matching coefficient (if the binary variable is symmetric （对称的）): Jaccard coefficient (if the binary variable is asymmetric （非对称的）): Object iObject j*AI&DM BUPT4.2 Binary Variables (cont.)*4.2 Binary Variables (cont.)Example gender is a symmetric attribute the remaining attributes are asymmetric binary attributes let the values Y and P be set to 1, and the value N be set to 0*AI&DM BUPT4.3 Nominal Variables （符号变量）*4.3 Nominal Variables （符号变量）Nominal Variables can be treated as a generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method : Simple matching - symmetric m: # of matches, p: total # of variables*AI&DM BUPT4.4 Ordinal Variables (顺序变量)*4.4 Ordinal Variables (顺序变量)Variables that order is important, e.g., rank Can be treated like interval-scaled replacing xif in rank order map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by compute the dissimilarity using methods for interval-scaled variables*AI&DM BUPT4.5 Variables of Mixed Types*4.5 Variables of Mixed TypesA database may contain all types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval-valued. One may use a weighted formula to combine their effects. f is binary or nominal: dij(f) = 0 if xif = xjf ; or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal compute ranks rif and zif treat zif as interval-scaled*AI&DM BUPT4.5 Variables of Mixed Types (cont.)*4.5 Variables of Mixed Types (cont.)One may use a weighted formula to combine their effects. xif or xjf is missing xif = xjf =0, and variable f is asymmetric Otherwise*AI&DM BUPT5. More about clustering Algorithms: K-means & K-medoids*5. More about clustering Algorithms: K-means & K-medoidsPartitioning method: Construct a partition of n objects into a set of k clusters Similarity Function: usually is distance k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster *AI&DM BUPTComments on the K-Means Method*Comments on the K-Means MethodStrength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k - the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes*AI&DM BUPTThe K-Medoids Clustering Method*The K-Medoids Clustering MethodFind representative objects, called medoids (聚类代表), in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering*AI&DM BUPT PAM (Partitioning Around Medoids)* PAM (Partitioning Around Medoids)PAM (Kaufman and Rousseeuw, 1987), Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change PAM works effectively for small data sets, but does not scale well for large data sets*AI&DM BUPT6. Agglomerative (凝聚的) Clustering (10.4)*6. Agglomerative (凝聚的) Clustering (10.4)Place each instance into a separate partition. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result. *AI&DM BUPTAgglomerative Clustering: An Example*Agglomerative Clustering: An Example*AI&DM BUPTnull**AI&DM BUPTSummary: Requirements of Clustering Algorithm *Summary: Requirements of Clustering Algorithm Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary (任意的) shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Insensitivity to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability*AI&DM BUPTChallenges Further Research*Challenges Further ResearchConsiderable progress has been made in scalable clustering methods Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, CURE Density-based: DBSCAN, CLIQUE, OPTICS Grid-based: STING, WaveCluster Model-based: Autoclass, Denclue, Cobweb Current clustering techniques do not address all the requirements adequately*AI&DM BUPTHomework*HomeworkPerform the third iteration of the k-means algorithm for the example given in the section “An Example Using K-Means”. What are the new cluster centers? Suppose that the data mining task is to cluster the following 8 points (with (x,y) representing location) into 3 clusters. A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9) The distance function is Manhattan distance. Suppose initially we assign A1, B1, and C1 as the center of each cluster, respectively. Use the k-means algorithm to show only: (a) the three cluster centers after the first round execution; (b) the final three clusters.*AI&DM BUPT

本文档为【lect-5-13】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

lect-5-13

热门搜索

历史搜索