Technology Sharing

Cluster Analysis Methods (III)

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


V. Quality Evaluation of Clustering

Cluster analysis is to decompose a data set into several subsets, each subset is called a cluster, and the set formed by all subsets is called a cluster of the object set. A good clustering algorithm should produce high-quality clusters and high-quality clusters, that is, the overall similarity within the cluster is the highest, and the overall similarity between clusters is the lowest. Given that many clustering algorithms, including k k k-Average algorithm, DBSCAN algorithm, etc. require the user to specify the number of clusters in the clustering in advance k k k, therefore, we first discuss a simple estimation method for k.

1. Estimation of the number of clusters

Many clustering algorithms, such as k k k-Average algorithm, even DIANA algorithm, etc., require the number of clusters to be specified in advance k k k,and k k kThe value of will greatly affect the quality of clustering, but the number of clusters is determined in advance. k k kThis is not an easy task. We can first consider two extreme cases.
(1) Take the entire data set S S SAs a cluster, k = 1 k=1 k=1, which seems simple and convenient, but the cluster analysis results are of no value.
(2) Dataset S S SEach object of is considered as a cluster, that is, k = ∣ S ∣ = n k=|S|=n k=S=n, which produces the most fine-grained clustering. Therefore, there is no intra-cluster difference in each cluster, and the intra-cluster similarity reaches the highest. However, this clustering cannot be used for S S SProvide any information about S S SA general description of .
It can be seen that the number of clusters k k kAt least it should satisfy 2 ≤ k ≤ n − 1 2≤k≤n-1 2kn1, but the number of clusters k k kIt is still unclear what value is most appropriate.
Generally considered, k k kThe value of can be estimated by the shape and scale of the data set distribution and the clustering resolution required by the user, and scholars have proposed many different estimation methods, such as the elbow method, cross-validation method, and information theory-based methods.
A simple and commonly used k k kThe empirical estimation method of the value believes that for n n nThe number of clusters in a dataset of objects k k kPick n 2 n2 2n In this case, under the average expectation, each cluster has approximately 2 n sqrt{2n} 2n On this basis, some people have proposed further additional restrictions, namely the number of clusters k < n k<sqrt{n} k<n
For example, suppose n = 8 n=8 n=8, then the number of clusters k = 2 k=2 k=2 is appropriate, and on average each cluster has 4 points, and according to the additional empirical formula k < 2.83 k<2.83 k<2.83Using these two cluster numbers k k kThe empirical formula seems to also illustrate from one side that in Example 10-5 k = 2 k=2 k=2 is the most appropriate number of clusters.

2. External quality evaluation

If we have estimated the number of clusters well, k k k, one or more clustering methods can be used, for example, k k k-average algorithm, agglomerative hierarchical algorithm or DBSCAN algorithm etc. perform cluster analysis on known data sets and obtain a variety of different clustering results. The question now is which method has better clustering results, or how to compare the clustering results generated by different methods, which is the quality evaluation of clustering.
Currently, there are many methods available for clustering quality evaluation, but they can generally be divided into two categories, namely extrinsic quality evaluation and intrinsic quality evaluation.
External quality evaluation assumes that an ideal clustering already exists for the data set (usually constructed by experts), and uses it as a common benchmark method to compare with the clustering results of a certain algorithm. Its comparative evaluation mainly includes two common methods: clustering entropy and clustering accuracy.

1. Clustering entropy method

Assume data set S = { X 1 , X 2 , … , X n } S={X_1,X_2,…,X_n} S={X1,X2,,Xn},and T = { T 1 , T 2 , … , T m } T={T_1,T_2,…,T_m} T={T1,T2,,Tm} is the ideal standard clustering given by experts, and C = { C 1 , C 2 , … , C k } C={C_1,C_2,…,C_k} C={C1,C2,,Ck} is determined by an algorithm S S SA cluster of, then for cluster C i C_i CiRelative to the benchmark clustering T T TThe clustering entropy is defined as
E ( C i ∣ T ) = − ∑ j = 1 m ∣ C i ∩ T j ∣ ∣ C i ∣ log ⁡ 2 ∣ C i ∩ T j ∣ ∣ C i ∣ (10-20) E(C_i|T)=-sum_{j=1}^mfrac{|C_icap T_j|}{|C_i|}log_2frac{|C_icap T_j|}{|C_i|}tag{10-20} E(CiT)=j=1mCiCiTjlog2CiCiTj(10-20) and C C CAbout Benchmarks T T TThe overall clustering entropy of is defined as C i C_i CiAbout Benchmarks T T TThe weighted average of the clustering entropy is
E ( C ) = 1 ∑ i = 1 k ∣ C i ∣ ∑ i = 1 k ∣ C i ∣ × E ( C i ∣ T ) (10-21) E(C)=frac{1}{mathop{sum}limits_{i=1}^k|C_i|}sum_{i=1}^k|C_i|times E(C_i|T)tag{10-21} E(C)=i=1kCi1i=1kCi×E(CiT)(10-21) The clustering entropy method believes that E ( C ) E(C) E(C) The smaller the value, the C C CRelative to the benchmark T T TThe higher the clustering quality.
It is worth noting that the denominator of the first term on the right side of formula (10-21) ∑ i = 1 k ∣ C i ∣ ki=1|Ci| i=1kCi is the sum of the number of elements in each cluster and cannot be used n n nTo replace. Because, only when C C Cis a partition cluster, the denominator is n n n, while the denominator of general clustering methods, such as DBSCAN clustering, may be smaller than n n n

2. Clustering accuracy

The basic idea of ​​clustering accuracy evaluation is to use the category with the largest number in the cluster as the category label of the cluster, that is, C i C_i Ci,if it exists T j T_j Tjmake ∣ C i ∩ T j ∣ = max ⁡ { ∣ C i ∩ T 1 ∣ , ∣ C i ∩ T 2 ∣ , ⋯   , ∣ C i ∩ T m ∣ } |C_icap T_j|=max{|C_icap T_1|,|C_icap T_2|,cdots,|C_icap T_m|} CiTj=max{CiT1,CiT2,,CiTm}, then it is believed that C i C_i CiThe category is T j T_j TjTherefore, the cluster C i C_i CiAbout Benchmarks T T TThe accuracy is defined as
J ( C i ∣ T ) = max ⁡ { ∣ C i ∩ T 1 ∣ , ∣ C i ∩ T 2 ∣ , ⋯   , ∣ C i ∩ T m ∣ } ∣ C i ∣ (10-22) J(C_i|T)=frac{max{|C_icap T_1|,|C_icap T_2|,cdots,|C_icap T_m|}}{|C_i|}tag{10-22} J(CiT)=Cimax{CiT1,CiT2,,CiTm}(10-22) and C C CAbout Benchmarks T T TThe overall accuracy is defined as the total accuracy of all clusters C i C_i CiAbout Benchmarks T T TThe weighted average of the clustering accuracy is
J ( C ) = 1 ∑ i = 1 k ∣ C i ∣ ∑ i = 1 k ∣ C i ∣ × J ( C i ∣ T ) (10-23) J(C)=frac{1}{mathop{sum}limits_{i=1}^k|C_i|}sum_{i=1}^k|C_i|times J(C_i|T)tag{10-23} J(C)=i=1kCi1i=1kCi×J(CiT)(10-23) The clustering accuracy method believes that J ( C ) J(C) J(C) The larger the value, the more clustered C C CRelative to the benchmark T T TThe higher the clustering quality.
In addition, generally 1 − J ( C ) 1-J(C) 1J(C) called C C CAbout Benchmarks T T TTherefore, the clustering accuracy J ( C ) J(C) J(C) Large or overall error rate 1 − J ( C ) 1-J(C) 1J(C) Small, which means that the clustering algorithm can better cluster objects of different categories into different clusters, that is, the clustering accuracy is high.

(III) Internal quality evaluation

Internal quality assessment has no known external benchmark and only uses datasets S S Sand clustering C C CThe inherent characteristics and values ​​of a cluster C C CThat is, the clustering effect is generally evaluated by calculating the average similarity within the cluster, the average similarity between clusters, or the overall similarity.
Internal quality evaluation is related to clustering algorithms. Clustering effectiveness indicators are mainly used to evaluate the quality of clustering effects or determine the optimal number of clusters. The ideal clustering effect is to have the smallest intra-cluster distance and the largest inter-cluster distance. Therefore, clustering effectiveness is generally measured by a certain ratio of intra-cluster distance to inter-cluster distance. Common indicators of this type include CH indicator, Dunn indicator, I indicator, Xie-eni indicator, etc.

1. CH index

The CH index is the abbreviation of the Calinski-Harabasz index. It first calculates the sum of the squares of the distances between each point in each cluster and its cluster center to measure the compactness within the class; then it calculates the sum of the squares of the distances between each cluster center and the center of the data set to measure the separation of the data set. The ratio of separation to compactness is the CH index.
set up X ‾ i overline{X}_i XiRepresentation Cluster C C CThe center point (mean), X ‾ overline{X} XRepresentation dataset S S SThe center point of d ( X ‾ i , X ‾ ) d(overline{X}_i,overline{X}) d(Xi,X) for X ‾ i overline{X}_i Xiarrive X ‾ overline{X} XIf a distance function of C C CThe compactness of the cluster is defined as
Trace ( A ) = ∑ i = 1 k ∑ X j ∈ C i d ( X j , X ‾ i ) 2 (10-24) text{Trace}(A)=sum_{i=1}^ksum_{X_jin C_i}d(X_j,overline{X}_i)^2tag{10-24} Trace(A)=i=1kXjCid(Xj,Xi)2(10-24) Therefore, Trace(A) is a cluster C C CThe sum of the squares of the center distances within the cluster. C C CThe separation is defined as
Trace ( B ) = ∑ i = 1 k ∣ C i ∣ d ( X ‾ i , X ‾ ) 2 (10-25) text{Trace}(B)=sum_{i=1}^k|C_i|d(overline{X}_i,overline{X})^2tag{10-25} Trace(B)=i=1kCid(Xi,X)2(10-25) That is, Trace(B) is a cluster C C CEach cluster center point to S S SThe weighted sum of the squared distances to the center points of .
Therefore, if N = ∑ i = 1 k ∣ C i ∣ N=ki=1|Ci| N=i=1kCi Then the CH index can be defined as
V CH ( k ) = Trace ( B ) / ( k − 1 ) Trace ( A ) / ( N − k ) (10-26) V_{text{CH}}(k)=frac{text{Trace}(B)/(k-1)}{text{Trace}(A)/(N-k)}tag{10-26} VCH(k)=Trace(A)/(Nk)Trace(B)/(k1)(10-26) Formula (10-26) is generally used in the following two situations:
(1) Evaluate which of the two algorithms produces a better clustering.
Suppose two algorithms are used to analyze the data set S S SPerform cluster analysis and obtain two different clusters (both containing k k kclusters), the clustering corresponding to the larger CH value is better, because the larger the CH value means that each cluster in the cluster is more compact and the clusters are more dispersed.
(2) Evaluate which of the two clusterings with different numbers of clusters obtained by the same algorithm is better.
Suppose an algorithm has a dataset S S SCluster analysis was performed and the number of clusters was obtained as k 1 k_1 k1and b 2 b_2 b2If there are two clusters, the cluster with a larger CH value has a better result, which also indicates that the number of clusters corresponding to this cluster is more appropriate. Therefore, by repeatedly applying formula (10-26), we can also obtain a data set: S S SThe optimal number of clusters for clustering.

2. Dunn indicator

Dunn indicator using clusters C i C_i CiWith cluster C j C_j CjMinimum distance between d s ( C i , C j ) d_s(C_i,C_j) ds(Ci,Cj) To calculate the separation between clusters, use the largest cluster diameter among all clusters max ⁡ { Φ ( C 1 ) , Φ ( C 2 ) , . . . , Φ ( C k ) } max{varPhi(C_1), varPhi(C_2),...,varPhi(C_k)} max{Φ(C1),Φ(C2),...,Φ(Ck)} To characterize the compactness within the cluster, the Dunn index is the minimum value of the ratio of the former to the latter, that is,
V D ( k ) = min ⁡ i ≠ j d s ( C i , C j ) max ⁡ { Φ ( C 1 ) , Φ ( C 2 ) , . . . , Φ ( C k ) } (10-27) V_D(k)=min_{i≠j}frac{d_s(C_i,C_j)}{max{varPhi(C_1), varPhi(C_2),...,varPhi(C_k)}}tag{10-27} VD(k)=i=jminmax{Φ(C1),Φ(C2),...,Φ(Ck)}ds(Ci,Cj)(10-27) The larger the Dunn value, the farther the distance between clusters, and the better the corresponding clustering. Similar to the CH evaluation index, the Dunn index can be used to evaluate the pros and cons of clustering obtained by different algorithms, and can also be used to evaluate which clustering with different numbers of clusters obtained by the same algorithm is better, that is, it can be used to find S S SThe optimal number of clusters.

6. Outlier Mining

Outliers are special data that deviate significantly from the majority of data in a data set. The data mining algorithms introduced above, such as classification and clustering, focus on discovering common patterns that apply to most data. Therefore, many data mining algorithms attempt to reduce or eliminate the impact of outliers and remove or ignore outliers as noise when implementing mining. However, in many practical applications, people suspect that the deviation of outliers is not caused by random factors, but may be caused by other completely different mechanisms, and they need to be mined for special analysis and utilization. For example, in application fields such as safety management and risk control, the pattern of identifying outliers is more valuable than the pattern of normal data.

1. Overview of related issues

The word Outlier is usually translated as outlier or abnormal. But it has many aliases in different applications, such as isolated point, abnormal point, novel point, deviation point, exceptional point, noise, abnormal data, etc. In Chinese literature, outlier mining has similar terms such as abnormal data mining, abnormal data detection, outlier data mining, exceptional data mining and rare event mining.

1. Generation of outliers

(1) Data comes from anomalies caused by fraud, intrusion, disease outbreaks, unusual experimental results, etc. For example, someone's average phone bill is about 200 yuan, but it suddenly increases to thousands of yuan in a certain month; someone's credit card usually spends about 5,000 yuan per month, but in a certain month the spending exceeds 30,000 yuan, etc. This type of outlier is usually relatively interesting in data mining and is one of the key points of application.
(2) Caused by inherent changes in data variables, reflecting the natural characteristics of data distribution, such as climate change, new customer purchasing patterns, gene mutations, etc. This is also one of the interesting focus areas.
(3) Data measurement and collection errors are mainly due to human error, measurement equipment failure or noise. For example, a student's grade of -100 in a course may be caused by the default value of the program setting; the salary of a company's senior management is significantly higher than that of ordinary employees, which may seem like an outlier, but it is reasonable data.

2. Outlier mining problem

Generally, the outlier mining problem can be decomposed into three sub-problems.
(1) Defining outliers
Since outliers are closely related to practical problems, clearly defining what kind of data is an outlier or abnormal data is the premise and primary task of outlier mining. Generally, it is necessary to combine the experience and knowledge of domain experts to give an appropriate description or definition of outliers.
(2) Mining outliers
After the outliers are clearly defined, the key task of outlier mining is to use an algorithm to effectively identify or mine the defined outliers. Outlier mining algorithms usually provide users with suspicious outlier data from the perspective of the laws that the data can reflect in order to attract the user's attention.
(3) Understanding outliers
The goal of outlier mining is to reasonably explain, understand and guide practical applications of mining results. Since the mechanism of outlier generation is uncertain, whether the "outliers" detected by the outlier mining algorithm actually correspond to actual abnormal behaviors cannot be explained and explained by the outlier mining algorithm, but can only be understood and explained by industry or field experts.

3. Relativity of outliers

Outliers are special data in a data set that are obviously different from most of the data, but "obvious" and "most of the data" are relative, that is, outliers are different, but they are relative. Therefore, several issues need to be considered when defining and mining outliers.
(1) Global or local outliers
A data object may be an outlier relative to its local neighbors, but not relative to the entire data set. For example, a student with a height of 1.9 meters is an outlier in the first class of the mathematics major in our school, but not among the people of the whole country, including professional players such as Yao Ming.
(2) Number of outliers
Although the number of outliers is unknown, the number of normal points should far exceed the number of outliers, that is, the proportion of outliers in a large data set should be low. It is generally believed that the number of outliers should be less than 5% or even less than 1%.
(3) Outlier factor of a point
You cannot use "yes" or "no" to report whether an object is an outlier. Instead, the degree of deviation of the object, that is, the outlier factor (Outlier Factor) or the outlier score (Outlier Score), should be used to characterize the degree to which a data deviates from the group. Then, objects with outlier factors higher than a certain threshold are filtered out and provided to decision makers or domain experts for understanding and explanation, and applied in actual work.

2. Distance-based methods

1. Basic concepts

Definition 10-11 There is a positive integer k k k, object X X Xof k k k-The nearest neighbor distance is a positive integer that satisfies the following conditions d k ( X ) d_k(X) dk(X)
(1) Except X X XIn addition, at least k k kObjects Y Y Ysatisfy d ( X , Y ) ≤ d k ( X ) d(X,Y)≤d_k(X) d(X,Y)dk(X)
(2) Except X X XIn addition, at most k − 1 k-1 k1 Objects Y Y Ysatisfy d ( X , Y ) < d k ( X ) d(X,Y)<d_k(X) d(X,Y)<dk(X)
in d ( X , Y ) d(X,Y) d(X,Y) Is an object X X Xand Y Y YSome distance function between .

of an object k k k-The larger the distance of the nearest neighbor, the more likely it is to be far away from most data objects, so the object can be X X Xof k k k-Nearest neighbor distance d k ( X ) d_k(X) dk(X) as its outlier factor.

Definition 10-12 make D ( X , k ) = { Y ∣ d ( X , Y ) ≤ d k ( X ) ∧ Y ≠ X } D(X,k)={Y|d(X,Y)≤d_k(X)wedge Y≠X} D(X,k)={Yd(X,Y)dk(X)Y=X}, then it is called D ( X , k ) D(X,k) D(X,k) yes X X Xof k k k-Nearest neighbor (Domain).

From definitions 10-12, we can see that D ( X , k ) D(X,k) D(X,k) So X X XCenter, distance X X XDoes not exceed d k ( X ) d_k(X) dk(X) Object Y Y YIt is worth noting that X X XIt does not belong to k k k-Nearest neighbor, i.e. X ∉ D ( X , k ) Xnotin D(X,k) X/D(X,k).In particular, X X Xof k k k-Nearest Neighbor D ( X , k ) D(X,k) D(X,k) The number of objects contained may far exceed k k k,Right now ∣ D ( X , k ) ∣ ≥ k |D(X,k)|≥k D(X,k)k

Definitions 10-13 There is a positive integer k k k, object X X Xof k k k-The nearest neighbor outlier factor is defined as
OF 1 ( X , k ) = ∑ Y ∈ D ( X , k ) d ( X , Y ) ∣ D ( X , k ) ∣ (10-28) text{OF}_1(X,k)=frac{mathop{sum}limits_{Yin D(X,k)}d(X,Y)}{|D(X,k)|}tag{10-28} OF1(X,k)=D(X,k)YD(X,k)d(X,Y)(10-28)

2. Algorithm Description

For a given data set and the number of nearest neighbor distances k k k, we can use the above formula to calculate the k k k-Nearest neighbor outlier factor, and sort them from large to small and output them. Among them, several objects with larger outlier factors are most likely to be outliers. Generally, decision makers or industry experts need to analyze and judge which of them are really outliers.

Algorithm 10-8 Distance-based outlier detection algorithm
Input: Dataset S S S, the number of nearest neighbor distances k k k
Output: A table of suspected outliers and their corresponding outlier factors in descending order
(1)REPEAT
(2) Take S S SAn unprocessed object in X X X
(3) Determination X X Xof k k k-Nearest Neighbor D ( X , k ) D(X,k) D(X,k)
(4) Calculation X X Xof k k k-Nearest neighbor outlier factor OF 1 ( X , k ) text{OF}_1(X,k) OF1(X,k)
(5)UNTIL S S SEach point has been processed
(6) Yes OF 1 ( X , k ) text{OF}_1(X,k) OF1(X,k)Arrange in descending order and output ( X , OF 1 ( X , k ) ) (X,text{OF}_1(X,k)) (X,OF1(X,k))

3. Calculation Example

Example 10-12 Suppose there are 11 points in the two-dimensional data set. S S SGiven in Table 10-10, let k = 2 k=2 k=2, try to calculate the Euclidean distance squared X 7 , X 10 , X 11 X_7, X_{10},X_{11} X7,X10,X11 The outlier factor to all other points.

insert image description here
untie:In order to intuitively understand the principle of the algorithm, S S SThe data objects in are displayed on the plane shown in the following figure (10-27).

insert image description here
The outlier factors of the specified point and other points are calculated below.

(1) Calculation object X 7 X_7 X7Outlier factor
As can be seen from the figure, the distance X 7 = ( 6 , 8 ) X_7=(6,8) X7=(6,8) The nearest point is X 10 = ( 5 , 7 ) X_{10}=(5,7) X10=(5,7),and d ( X 7 , X 10 ) = 1.41 d(X_7,X_{10}) =1.41 d(X7,X10)=1.41, the other closest points may be X 11 = ( 5 , 2 ) X_{11}=(5,2) X11=(5,2) X 9 = ( 3 , 2 ) X_9=(3,2) X9=(3,2) X 8 = ( 2 , 4 ) X_8=(2,4) X8=(2,4)
Calculated d ( X 7 , X 11 ) = 6.08 d(X_7,X_{11})=6.08 d(X7,X11)=6.08 d ( X 7 , X 9 ) = 6.71 d(X_7,X_9)=6.71 d(X7,X9)=6.71 d ( X 7 , X 8 ) = 5.66 d(X_7,X_8)=5.66 d(X7,X8)=5.66
because k = 2 k=2 k=2,so d 2 ( X 7 ) = 5.66 d_2(X_7)=5.66 d2(X7)=5.66, so according to definition 10-11 we have D ( X 7 , 2 ) = { X 10 , X 8 } D(X_7,2)={X_{10},X_8} D(X7,2)={X10,X8}
According to formula (10-28), X 7 X_7 X7Outlier factor
OF 1 ( X 7 , 2 ) = ∑ Y ∈ N ( X 7 , 2 ) d ( X 7 , Y ) ∣ N ( X 7 , k ) ∣ = d ( X 7 , X 10 ) + d ( X 7 , X 8 ) 2 = 1.41 + 5.66 2 = 3.54 OF1(X7,2)=YN(X7,2)d(X7,Y)|N(X7,k)|=d(X7,X10)+d(X7,X8)2=1.41+5.662=3.54 OF1(X7,2)=N(X7,k)YN(X7,2)d(X7,Y)=2d(X7,X10)+d(X7,X8)=21.41+5.66=3.54(2) Calculation object X 10 X_{10} X10Outlier factor OF 1 ( X 10 , 2 ) = 2.83 text{OF}_1(X_{10},2)=2.83 OF1(X10,2)=2.83

(3) Calculation object X 11 X_{11} X11Outlier factor OF 1 ( X 11 , 2 ) = 2.5 text{OF}_1(X_{11},2)=2.5 OF1(X11,2)=2.5

(4) Calculation object X 5 X_{5} X5Outlier factor OF 1 ( X 5 , 2 ) = 1 text{OF}_1(X_{5},2)=1 OF1(X5,2)=1

Similarly, the outlier factors of the remaining objects can be calculated, see the following table (10-11).

insert image description here
4. Outlier Factor Threshold

according to k k k- According to the nearest neighbor theory, the larger the outlier factor, the more likely it is an outlier. Therefore, a threshold must be specified to distinguish outliers from normal points. The simplest method is to specify the number of outliers, but this method is too simple and sometimes misses some real outliers or attributes too many normal points to possible outliers, making it difficult for domain experts or decision makers to understand and interpret outliers.
(1) The outlier factor segmentation threshold method first arranges the outlier factors in descending order, and then renumbers the data objects in ascending order according to the outlier factors.
(2) Outlier Factor OF 1 ( X , k ) text{OF}_1(X,k) OF1(X,k) As the ordinate, the outlier factor sequence number is the abscissa, that is, (sequence number, OF 1 text{OF}_1 OF1Value) are marked on the plane and connected to form a non-increasing broken line. The point where the broken line drops sharply and drops gently is found to correspond to the outlier factor as the threshold. Objects with outlier factors less than or equal to this threshold are normal objects, and the others are possible outliers.

Example 10-13 For the data set in Example 10-12 S S S, and their outlier factors are arranged in descending order and summarized in Table 10-11. Try to find the threshold of outliers based on the outlier factor segmentation threshold method.

untie:First, take Table 10-11 (serial number, OF 1 text{OF}_1 OF1Values) are used as points on the plane, marked on the plane and connected with broken lines, as shown in Figure 10-28.

insert image description here
Then, we can observe Figure 10-28 and find that the line on the left of the fourth point (4, 1.27) drops very steeply, while the line on the right drops very gently. Therefore, we choose the outlier factor 1.27 as the threshold. X 7 、 X 10 X_7、X_{10} X7X10 and X 11 X_{11} X11 The outlier factors of are 3.54, 2.83 and 2.5, which are all greater than 1.27. Therefore, these three points are most likely outliers, and the rest are ordinary points.
Looking at Figure 10-27 again, we can see that X 7 、 X 10 X_7、X_{10} X7X10 and X 11 X_{11} X11 Indeed, they are far away from the dense majority of objects on the left, so we treat them as a dataset S S SThe outliers are reasonable.

5. Algorithm Evaluation

The biggest advantage of the distance-based outlier detection method is its simple principle and ease of use. Its shortcomings are mainly reflected in the following aspects.
(1) Parameters k k kThere is no simple and effective method to determine the selection of k k kThere is no unanimously accepted analytical result on the sensitivity level.
(2) The time complexity is O ( ∣ S ∣ 2 ) O(|S|^2) O(S2), lack of scalability for large-scale data sets.
(3) Due to the use of a global outlier factor threshold, it is difficult to mine outliers in a dataset with regions of different densities.

3. Method based on relative density

The distance method is a global outlier detection method, but it cannot handle data sets with different density areas, that is, it cannot detect outliers in local density areas. In actual applications, data is not always distributed in a single density. When the data set contains multiple density distributions or is a mixture of different density subsets, global outlier detection methods such as distance usually do not work well, because whether an object is an outlier depends not only on the distance between it and the surrounding data, but also on the density conditions in the neighborhood.

1. The concept of relative density

From the perspective of density neighborhood, outliers are objects in low-density areas. Therefore, it is necessary to introduce the concepts of local neighborhood density and relative density of objects.

Definitions 10-14 (1) An object X X Xof k k k-Nearest neighbor local density is defined as
dsty ( X , k ) = ∣ D ( X , k ) ∣ ∑ Y ∈ D ( X , k ) d ( X , Y ) (10-29) text{dsty}(X,k)=frac{|D(X,k)|}{mathop{sum}limits_{Yin D(X,k)}d(X,Y)}tag{10-29} dsty(X,k)=YD(X,k)d(X,Y)D(X,k)(10-29) (2) An object X X Xof k k k-Nearest neighbor local relative density (relative density)
rdsty ( X , k ) = ∑ Y ∈ D ( X , k ) dsty ( X , k ) / ∣ D ( X , k ) ∣ dsty ( X , k ) (10-30) text{rdsty}(X,k)=frac{mathop{sum}limits_{Yin D(X,k)}text{dsty}(X,k)/|D(X,k)|}{text{dsty}(X,k)}tag{10-30} rdsty(X,k)=dsty(X,k)YD(X,k)dsty(X,k)/∣D(X,k)(10-30) in D ( X , k ) D(X,k) D(X,k) It's the object X X Xof k k k- nearest neighbor (given in Definitions 10-12), ∣ D ( X , k ) ∣ |D(X,k)| D(X,k) is the number of objects in the collection.

2. Algorithm Description

by rdsty ( X , k ) text{rdsty}(X,k) rdsty(X,k) As an outlier OF 2 ( X , k ) text{OF}_2(X,k) OF2(X,k), which is calculated in two steps
(1) According to the number of neighbors k k k, calculate each object X X Xof k k k-Nearest neighbor local density dsty ( X , k ) text{dsty}(X,k) dsty(X,k)
(2) Calculation X X XThe average density of neighbors and k k k-Nearest neighbor local relative density rdsty ( X , k ) text{rdsty}(X,k) rdsty(X,k)
A data set consists of multiple natural clusters. The relative density of objects close to the core point in the cluster is close to 1, while the relative density of objects at the edge of the cluster or outside the cluster is relatively large. Therefore, the larger the relative density value, the more likely it is an outlier.

Algorithm 10-9 Outlier Detection Algorithm Based on Relative Density
Input: Dataset S S S, the number of nearest neighbors k k k
Output: A table of suspected outliers and their corresponding outlier factors in descending order
(1)REPEAT
(2) Take S S SAn unprocessed object in X X X
(3) Determination X X Xof k k k-Nearest Neighbor D ( X , k ) D(X,k) D(X,k)
(4) Utilization D ( X , k ) D(X,k) D(X,k)calculate X X XDensity dsty ( X , k ) text{dsty}(X,k) dsty(X,k)
(5)UNTIL S S SEach point has been processed
(6)REPEAT
(7) Take S S SThe first object in X X X
(8) Determine X X XRelative density rdsty ( X , k ) text{rdsty}(X,k) rdsty(X,k), and assign it to OF 2 ( X , k ) text{OF}_2(X,k) OF2(X,k)
(9)UNTIL S S SAll objects in have been processed
(10) Yes OF 2 ( X , k ) text{OF}_2(X,k) OF2(X,k)Arrange in descending order and output ( X , OF 2 ( X , k ) ) (X,text{OF}_2(X,k)) (X,OF2(X,k))

Example 10-14 For the two-dimensional data set given in Example 10-12 S S S (See Table 10-10 for details), k = 2 k=2 k=2, try Euclidean distance calculation X 7 , X 10 , X 11 X_7, X_{10},X_{11} X7,X10,X11 The outlier factor of objects such as is based on the relative density.

insert image description here
untie:because k = 2 k=2 k=2, so we need the 2-nearest neighbor local density of all objects.

(1) Find the 2-nearest neighbors of each data object in Table 10-11 D ( X i , 2 ) D(X_i,2) D(Xi,2)
According to the same calculation method as in Example 10-12, we can get
D ( X 1 , 2 ) = { X 2 , X 3 , X 5 } , D ( X 2 , 2 ) = { X 1 , X 6 } ,               D ( X 3 , 2 ) = { X 1 , X 4 } , D ( X 4 , 2 ) = { X 3 , X 5 } ,        D ( X 5 , 2 ) = { X 1 , X 4 , X 6 , X 9 } , D ( X 6 , 2 ) = { X 2 , X 5 , X 8 } , D ( X 7 , 2 ) = { X 10 , X 8 } ,      D ( X 8 , 2 ) = { X 2 , X 6 } ,                D ( X 9 , 2 ) = { X 5 , X 4 , X 6 } , D ( X 10 , 2 ) = { X 7 , X 8 } ,      D ( X 11 , 2 ) = { X 9 , X 5 } D(X1,2)={X2,X3,X5}D(X2,2)={X1,X6}              D(X3,2)={X1,X4}D(X4,2)={X3,X5}       D(X5,2)={X1,X4,X6,X9}D(X6,2)={X2,X5,X8}D(X7,2)={X10,X8}     D(X8,2)={X2,X6}               D(X9,2)={X5,X4,X6}D(X10,2)={X7,X8}     D(X11,2)={X9,X5} D(X1,2)={X2,X3,X5}D(X2,2)={X1,X6}              D(X3,2)={X1,X4}D(X4,2)={X3,X5}       D(X5,2)={X1,X4,X6,X9}D(X6,2)={X2,X5,X8}D(X7,2)={X10,X8}     D(X8,2)={X2,X6}               D(X9,2)={X5,X4,X6}D(X10,2)={X7,X8}     D(X11,2)={X9,X5}

(2) Calculate the local density of each data object dsty ( X i , 2 ) text{dsty}(X_i,2) dsty(Xi,2)

① Calculation X 1 X_1 X1Density
because D ( X 1 , 2 ) = { X 2 , X 3 , X 5 } D(X_1,2)={X_2,X_3,X_5} D(X1,2)={X2,X3,X5}, so it is calculated that d ( X 1 , X 2 ) = 1 d(X_1,X_2)=1 d(X1,X2)=1 d ( X 1 , X 3 ) = 1 d(X_1,X_3)=1 d(X1,X3)=1 d ( X 1 , X 5 ) = 1 d(X_1,X_5)=1 d(X1,X5)=1
According to formula (10-29), we get:
dsty ( X 1 , 2 ) = ∣ D ( X 1 , 2 ) ∣ ∑ Y ∈ N ( X 1 , 2 ) d ( X 1 , Y ) = ∣ N ( X 1 , 2 ) ∣ d ( X 1 , X 2 ) + d ( X 1 , X 3 ) + d ( X 1 , X 5 ) = 3 1 + 1 + 1 = 1 dsty(X1,2)=|D(X1,2)|YN(X1,2)d(X1,Y)=|N(X1,2)|d(X1,X2)+d(X1,X3)+d(X1,X5)=31+1+1=1 dsty(X1,2)=YN(X1,2)d(X1,Y)D(X1,2)=d(X1,X2)+d(X1,X3)+d(X1,X5)N(X1,2)=1+1+13=1

② Calculation X 2 X_2 X2Density
because D ( X 2 , 2 ) = { X 1 , X 6 } D(X_2,2)={X_1,X_6} D(X2,2)={X1,X6}, so the calculated d ( X 2 , X 1 ) = 1 d(X_2,X_1) =1 d(X2,X1)=1 d ( X 2 , X 6 ) = 1 d(X_2,X_6) =1 d(X2,X6)=1
According to formula (10-29), we get:
dsty ( X 2 , 2 ) = ∣ D ( X 2 , 2 ) ∣ ∑ Y ∈ N ( X 2 , 2 ) d ( X 2 , Y ) = 2 1 + 1 = 1 dsty(X2,2)=|D(X2,2)|YN(X2,2)d(X2,Y)=21+1=1 dsty(X2,2)=YN(X2,2)d(X2,Y)D(X2,2)=1+12=1

The local density of other data objects can be calculated similarly, see Table 10-12 below.

insert image description here
(3) Calculate each object X i X_i XiRelative density rdsty ( X i , 2 ) text{rdsty}(X_i, 2) rdsty(Xi,2), and use it as an outlier factor OF 2 text{OF}_2 OF2
① Calculation X 1 X_1 X1Relative density
Using the density value of each object in Table 10-12, according to the relative density formula (10-30), we have:
rdsty ( X 1 , 2 ) = ∑ Y ∈ N ( X 1 , 2 ) dsty ( Y , 2 ) / ∣ N ( X 1 , 2 ) ∣ dsty ( X 1 , 2 ) = ( 1 + 1 + 1 ) / 3 1 = 1 = OF 2 ( X 1 , 2 ) rdsty(X1,2)=YN(X1,2)dsty(Y,2)/|N(X1,2)|dsty(X1,2)=(1+1+1)/31=1=OF2(X1,2) rdsty(X1,2)=dsty(X1,2)YN(X1,2)dsty(Y,2)/∣N(X1,2)=1(1+1+1)/3=1=OF2(X1,2)

② Similar calculations can be obtained X 2 、 X 3 、 … 、 X 11 X_2、X_3、…、X_{11} X2X3X11 The relative density value.
for example X 5 X_5 X5Relative density:
rdsty ( X 5 , 2 ) = ∑ Y ∈ N ( X 5 , 2 ) dsty ( Y , 2 ) / ∣ N ( X 5 , 2 ) ∣ dsty ( X 5 , 2 ) = ( 1 + 1 + 1 + 0.79 ) / 4 1 = 0.95 = OF 2 ( X 5 , 2 ) rdsty(X5,2)=YN(X5,2)dsty(Y,2)/|N(X5,2)|dsty(X5,2)=(1+1+1+0.79)/41=0.95=OF2(X5,2) rdsty(X5,2)=dsty(X5,2)YN(X5,2)dsty(Y,2)/∣N(X5,2)=1(1+1+1+0.79)/4=0.95=OF2(X5,2) The results are summarized in Tables 10-13 below.

insert image description here
Example 10-15 Given the data set shown in Table 10-14, please use the Euclidean distance k = 2 , 3 , 5 k=2,3,5 k=2,3,5, calculate each point k k k- nearest neighbor local density, k k k-Nearest neighbor local relative density (outlier factor OF 2 text{OF}_2 OF2) and based on k k k-Outlier factor of the nearest neighbor distance OF 1 text{OF}_1 OF1

insert image description here
untie:(1)For ease of understanding, S S SThe relative positions of the points are marked on the two-dimensional plane (Figure 10-30).

insert image description here
(2) Use the distance-based and relative density-based algorithms 10-8 and 10-9 respectively. Calculate the k k k-Nearest neighbor local density dsty text{dsty} dsty k k k-Nearest neighbor local relative density (outlier factor OF 2 text{OF}_2 OF2) and based on k k k-Outlier factor of the nearest neighbor distance OF 1 text{OF}_1 OF1, and the results are summarized in Tables 10-15.

insert image description here
(3) Simple analysis
① As can be seen from Figure 10-30, X 15 X_{15} X15and X 16 X_{16} X16yes S S SThere are two obvious outliers in the figure, and the distance-based and relative density-based methods can mine them well.
② From this example, we can see that the two algorithms k k kNot as sensitive as expected, maybe it's an outlier X 15 X_{15} X15and X 16 X_{16} X16Because it is clearly separated from other objects.
③ From Table 10-15, we can see that regardless of k k kTake 2, 3 or 5, X 1 X_1 X1The area dsty text{dsty} dsty The values ​​are significantly lower than X 7 X_7 X7The area dsty text{dsty} dsty This is consistent with the regional density shown in Figure 10-30. However, the relative density values ​​of the two regions OF 2 text{OF}_2 OF2 There is almost no noticeable difference. This is determined by the nature of relative density, that is, for uniformly distributed data points, the relative density of the core points is 1, regardless of the distance between the points.

7. Other Clustering Methods

1. Improved clustering algorithm

  (1) k k k-Module ( k k k-modes) algorithm is for k k k-The average algorithm is only suitable for numerical attributes and is proposed to achieve fast clustering of discrete data. k k kThe -modulo algorithm uses a simple 0-1 matching method to calculate the distance between two attribute values ​​under the same discrete attribute, which weakens the difference between the ordinal attribute values. That is, it cannot fully reflect the true distance between two attribute values ​​under the same ordinal attribute, and there is still room for improvement and perfection.
  (2) k k k-prototype ( k k k-Prototype) algorithm combines k k k-Averaging algorithm and k k kThe advantage of the -modular algorithm is that it can cluster data sets with both discrete and numerical attributes (called mixed attributes). k k k-Modular arithmetic calculation object X X Xand Y Y Ythe distance between d 1 ( X , Y ) d_1(X,Y) d1(X,Y), for numerical attributes, k k k-Methods in the average algorithm calculate the distance between objects d 2 ( X , Y ) d_2(X,Y) d2(X,Y), and finally use the weighted method, that is α d 1 ( X , Y ) + ( 1 − α ) d 2 ( X , Y ) alpha d_1(X,Y)+(1-alpha)d_2(X,Y) αd1(X,Y)+(1α)d2(X,Y) As a data set object X X Xand Y Y Ythe distance between d ( X , Y ) d(X,Y) d(X,Y),in α ∈ [ 0 , 1 ] alphain[0,1] α[0,1] is the weight coefficient, which is usually taken as α = 0.5 alpha=0.5 α=0.5
(3) The BIRCH algorithm (Balanced Iterative Reducing and Clustering Using Hierarchies) is a comprehensive hierarchical clustering method. It uses clustering features (CF) and clustering feature trees (CF Tree, similar to B-tree) to summarize and describe clusters. C i C_i Ci,in CF i = ( n i , LS i , SS i ) text{CF}_i=(ni, text{LS}_i,text{SS}_i) CFi=(ni,LSi,SSi) is a triple, n i n_i niis the number of objects in the cluster, LS i text{LS}_i LSiyes n i n_i niThe linear sum of the object components, SS i text{SS}_i SSiyes n i n_i niThe sum of the squares of the components of the objects.
(4) CURE (Clustering Using Representatives) algorithm is a k k k-Another improvement of the average algorithm. Many clustering algorithms are only good at clustering spherical clusters, while some clustering algorithms are more sensitive to isolated points. In order to solve the above two problems, the CURE algorithm changes k k k-Average algorithm uses cluster centers and k k k-The center point algorithm uses a single specific object to represent a cluster in the traditional method. Instead, it selects multiple representative objects in the cluster to represent a cluster, so that it can adapt to the clustering of non-spherical clusters and reduce the impact of noise on clustering.
(5) ROCK (RObust Clustering using linK) algorithm is a clustering algorithm proposed for binary or categorical attribute data sets.
(6) OPTICS (Ordering Points To Identify the Clustering Structure) algorithm is used to reduce the density of the DBSCAN algorithm. ( ε , MinPts ) (varepsilon,text{MinPts}) (ε,MinPts) It is proposed to test the sensitivity of the parameters. It does not produce the resulting clusters explicitly, but generates an augmented cluster ranking for cluster analysis (for example, a coordinate graph with reachable distance as the vertical axis and the output order of the sample points as the horizontal axis). This ranking represents the density-based clustering structure of each sample point. We can get the clustering structure based on any density parameter from this ranking. ( ε , MinPts ) (varepsilon,text{MinPts}) (ε,MinPts) Clustering results of the DBSCAN algorithm.

2. Other new clustering methods

Use some new theories or techniques to design new clustering methods.

(1) Grid-based clustering method
The grid-based method quantizes the object space into a finite number of cells to form a grid structure, and the position information of the segmentation points in each dimension is stored in an array. The segmentation line runs through the entire space, and all clustering operations are performed on this grid structure (i.e., quantized space). The main advantage of this method is that it has a fast processing speed, which is independent of the number of data objects and only related to the number of cells in each dimension in the quantized space. However, its efficiency improvement comes at the cost of reducing the accuracy of the clustering results. Since the grid clustering algorithm has the problem of quantization scale, it is usually started from small cells to find clusters, and then the volume of the cells is gradually increased, and this process is repeated until a satisfactory cluster is found.

(2) Model-based clustering method
Model-based methods assume a model for each cluster and find the best fit of the data to the given model. Model-based methods locate clusters by establishing a density function that reflects the spatial distribution of samples, attempting to optimize the adaptability between given data and certain data models.

(3) Clustering method based on fuzzy sets
In practice, most objects do not have a strict belonging value to which cluster, and their belonging values ​​and forms are intermediate or uncertain, which is suitable for soft segmentation. Fuzzy cluster analysis has the advantage of describing the intermediate nature of sample belonging and can objectively reflect the real world, becoming one of the hot spots in cluster analysis research today.
The fuzzy clustering algorithm is an unsupervised learning method based on fuzzy mathematics theory and an uncertain clustering method. Once fuzzy clustering was proposed, it received great attention from the academic community. Fuzzy clustering is a large clustering "family" and research on fuzzy clustering is also very active.

(4) Clustering method based on rough sets
Rough clustering is an uncertain clustering method based on rough set theory. From the coupling of rough sets and clustering algorithms, rough clustering methods can be divided into two categories: strong coupling rough clustering and weak coupling rough clustering.
Of course, there are far more new research directions in cluster analysis than these. For example, data stream mining and clustering algorithms, uncertain data and their clustering algorithms, quantum computing and quantum genetic clustering algorithms, etc., are all cutting-edge topics in clustering research that have emerged in recent years.

3. Other outlier mining methods

The outlier mining methods introduced above are only two representatives of outlier mining. There are many more mature outlier mining methods in practical applications. They can be classified from two perspectives: the type of technology used by the mining method or the degree of utilization of prior knowledge.

(1) Type of technology used
The main ones include statistics-based methods, distance-based methods, density-based methods, clustering-based methods, deviation-based methods, depth-based methods, wavelet transform-based methods, graph-based methods, pattern-based methods, neural network-based methods, etc.

(2) Utilization of prior knowledge
There are three common approaches, depending on the degree of information available about the normal or outlier classes:
① Unsupervised outlier detection method, that is, there is no prior knowledge such as category labels in the data set;
② Supervised outlier detection method, that is, extracting the features of outliers from the training set containing outliers and normal points;
③ Semi-supervised outlier detection method, the training data contains labeled normal data, but no information about outlier data objects.