Berbagi teknologi

Metode analisis klaster (3)

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


5. Evaluasi kualitas clustering

Analisis klaster adalah menguraikan suatu kumpulan data menjadi himpunan bagian-bagian, setiap himpunan bagian disebut klaster, dan himpunan semua himpunan bagian disebut klaster dari himpunan objek. Algoritma clustering yang baik harus menghasilkan cluster yang berkualitas tinggi dan cluster yang berkualitas tinggi, yaitu kesamaan keseluruhan dalam cluster adalah yang tertinggi, sedangkan kesamaan keseluruhan antar cluster adalah yang terendah.Mengingat banyak algoritma clustering yang menyertakan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritma rata-rata, algoritma DBSCAN, dll. semuanya mengharuskan pengguna untuk menentukan jumlah cluster dalam cluster terlebih dahulu kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, oleh karena itu, metode estimasi sederhana k akan dibahas di bawah ini.

(1) Estimasi jumlah cluster

Banyak algoritma clustering seperti kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritma rata-rata, bahkan algoritma DIANA, dll., perlu menentukan jumlah cluster terlebih dahulu kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu,Dan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuNilai dari akan sangat mempengaruhi kualitas clustering. Namun, jumlah cluster harus ditentukan terlebih dahulu. kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu Bukan tugas yang mudah. Pertama-tama kita dapat mempertimbangkan dua kasus ekstrem.
(1) Masukkan seluruh kumpulan data Bahasa InggrisSdianggap sebagai sebuah cluster, yaitu, k=1 k=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=1, ini tampak sederhana dan mudah, namun hasil analisis klaster ini tidak ada nilainya.
(2) Masukkan kumpulan data Bahasa InggrisSSetiap objek diperlakukan sebagai sebuah cluster, yaitu, biarkan Bahasa Indonesia: k = ∣ S ∣ = nk = |S| = naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=S=N , sehingga menghasilkan pengelompokan yang paling halus. Oleh karena itu, tidak ada perbedaan intra-cluster di setiap cluster, dan kesamaan intra-cluster mencapai level tertinggi.Namun pengelompokan semacam ini tidak dapat digunakan Bahasa InggrisSmemberikan informasi apa pun tentang Bahasa InggrisSgambaran umum.
Terlihat jumlah clusternya kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuusetidaknya harus memuaskan 2 ≤ k ≤ n - 1 2≤ k ≤ n - 12aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuN1, tetapi jumlah cluster kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuNilai apa yang paling tepat masih bersifat ambigu.
Secara umum dipertimbangkan, kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuNilai dapat diperkirakan berdasarkan bentuk dan skala distribusi kumpulan data, serta resolusi pengelompokan yang diperlukan oleh pengguna, dan para ahli memiliki banyak metode estimasi yang berbeda, seperti metode siku, metode validasi silang, dan teori informasi- metode berbasis dll.
Sederhana dan umum digunakan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuMetode estimasi nilai empiris meyakini bahwa bagi mereka yang memiliki tidak adaNKumpulan data objek, jumlah cluster yang mengelompokkannya kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuMemilih nomor 2N2 2N Itu tepat.Saat ini, di bawah ekspektasi rata-rata, setiap cluster memiliki sekitar 2 n akar{2n}2N objek.Atas dasar itu, sejumlah pihak mengusulkan pembatasan tambahan lebih lanjut yakni jumlah klaster k &lt; nkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu<N
Misalnya saja n=8 n=8N=8, lalu jumlah cluster k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2 sudah sesuai, dan rata-rata terdapat 4 titik per cluster, dan sesuai rumus empiris tambahan k&lt; 2,83 k&lt; 2,83aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu<2.83 .Menggunakan dua informasi tentang jumlah cluster kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuRumus empirisnya sepertinya dijelaskan dari satu sisi, pada Contoh 10-5 k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2 adalah jumlah cluster yang paling tepat.

(2) Evaluasi kualitas eksternal

Jika kita mempunyai perkiraan yang baik mengenai jumlah cluster kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, Anda dapat menggunakan satu atau beberapa metode pengelompokan, misalnya, kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu -Algoritma rata-rata, algoritma hierarki aglomeratif atau algoritma DBSCAN melakukan analisis cluster pada kumpulan data yang diketahui dan memperoleh berbagai hasil clustering yang berbeda. Pertanyaannya sekarang adalah metode mana yang memiliki hasil clustering yang lebih baik, atau dengan kata lain bagaimana membandingkan hasil clustering yang dihasilkan oleh metode yang berbeda. Inilah evaluasi kualitas clustering.
Saat ini terdapat banyak metode yang dapat dipilih untuk evaluasi kualitas clustering, namun secara umum dapat dibagi menjadi dua kategori, yaitu evaluasi kualitas eksternal (ekstrinsik) dan evaluasi kualitas internal (intrinsik).
Evaluasi kualitas eksternal mengasumsikan bahwa cluster ideal sudah ada dalam kumpulan data (biasanya dibangun oleh para ahli), dan membandingkannya sebagai metode benchmark yang umum digunakan dengan hasil clustering dari algoritma tertentu.Evaluasi komparatifnya terutama mencakup clustering entropy dan clustering There adalah dua metode umum untuk presisi kelas.

1. Metode pengelompokan entropi

Kumpulan data hipotetis Persamaan kuadrat terkecil adalah x1, x2, …, dan xn. Persamaan kuadrat terkecil adalah x1, x2, …, dan xn.S={X1,X2,,XN},Dan Contoh soal T1 = { T1, T2, …, Tm}T={T1,T2,,TM} merupakan standar pengelompokan ideal yang diberikan oleh para ahli, dan C = { C1, C2, …, Ck } C = { C_1, C_2, …, C_k }C={C1,C2,,Caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu} ditentukan oleh algoritma tentang Bahasa InggrisSSekelompok , lalu untuk cluster C saya C_iCSayaRelatif terhadap pengelompokan dasar TTTEntropi pengelompokan didefinisikan sebagai
Bahasa Indonesia: E(C i ∣ T) = − ∑ j = 1 m ∣ C i ∩ T j ∣ ∣ C i ∣ log ⁡ 2 ∣ C i ∩ T j ∣ ∣ C i ∣ (10-20) E(C_i|T)=-jumlah_{j=1}^mfrac{|C_icap T_j|}{|C_i|}log_2frac{|C_icap T_j|}{|C_i|}tag{10-20}Bahasa Inggris(CSayaT)=J=1MCSayaCSayaTJlihatG2CSayaCSayaTJ(10-20) Dan Bahasa InggrisCTentang tolok ukur TTTEntropi pengelompokan keseluruhan didefinisikan sebagai semua cluster C saya C_iCSayaTentang tolok ukur TTTRata-rata tertimbang dari entropi pengelompokan, yaitu
E(C) = 1 ∑ i = 1 k ∣ C i ∣ ∑ i = 1 k ∣ C i ∣ × E(C i ∣ T) (10-21) E(C)=frac{1}{mathop{jumlah}limit_{i=1}^k|C_i|}jumlah_{i=1}^k|C_i|kali E(C_i|T)tag{10-21}Bahasa Inggris(C)=Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya1Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya×Bahasa Inggris(CSayaT)(10-21) Metode entropi pengelompokan percaya bahwa, E ( C ) E(C)Bahasa Inggris(C) Semakin kecil nilainya maka Bahasa InggrisCRelatif terhadap garis dasar TTTSemakin tinggi kualitas clusteringnya.
Perlu dicatat bahwa penyebut suku pertama di sisi kanan rumus (10-21) Tentukan i = 1 k ∣ C i ∣aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuSaya=1|CSaya| Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya adalah jumlah dari jumlah elemen di setiap cluster, dan tidak dapat digunakan tidak adaN untuk menggantikan.Sebab, hanya kapan saja Bahasa InggrisCKapan cluster partisi, penyebutnya adalah tidak adaN, dan penyebut metode pengelompokan umum, seperti pengelompokan DBSCAN, mungkin kurang dari tidak adaN

2. Akurasi pengelompokan

Ide dasar evaluasi akurasi (presisi) pengelompokan adalah dengan menggunakan jumlah kategori terbesar dalam klaster sebagai label kategori klaster, yaitu untuk klaster tersebut. C saya C_iCSaya, jika itu ada T_j T_jTJmembuat ∣ C i ∩ T j ∣ = maks ⁡ { ∣ C i ∩ T 1 ∣ , ∣ C i ∩ T 2 ∣ , ⋯ , ∣ C i ∩ T m ∣ } |C_icap T_j|=maks{|C_icap T_1|,|C_icap T_2|,cdots,|C_icap T_m|}CSayaTJ=maks{CSayaT1,CSayaT2,,CSayaTM}, dianggap demikian C saya C_iCSayaKategorinya adalah T_j T_jTJ .Oleh karena itu, cluster C saya C_iCSayaTentang tolok ukur TTTAkurasi didefinisikan sebagai
J(C i ∣ T ) = maks ⁡ { ∣ C i ∩ T 1 ∣ , ∣ C i ∩ T 2 ∣ , ⋯ , ∣ C i ∩ T m ∣ } ∣ C i ∣ (10-22) J(C_i|T)=frac{maks{|C_icap T_1|,|C_icap T_2|,cdots,|C_icap T_m|}}{|C_i|}tag{10-22}J(CSayaT)=CSayamaks{CSayaT1,CSayaT2,,CSayaTM}(10-22) Dan Bahasa InggrisCTentang tolok ukur TTTAkurasi keseluruhan ditentukan untuk semua cluster C saya C_iCSayaTentang tolok ukur TTTRata-rata tertimbang dari akurasi pengelompokan, yaitu
J(C) = 1 ∑ i = 1 k ∣ C i ∣ ∑ i = 1 k ∣ C i ∣ × J( C i ∣ T ) (10-23) J(C)=frac{1}{mathop{jumlah}limit_{i=1}^k|C_i|}jumlah_{i=1}^k|C_i|kali J(C_i|T)tag{10-23}J(C)=Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya1Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya×J(CSayaT)(10-23) Metode akurasi pengelompokan percaya bahwa, J(C) J(C)J(C) Semakin besar nilainya, maka terjadi clustering Bahasa InggrisCRelatif terhadap garis dasar TTTSemakin tinggi kualitas clusteringnya.
Selain itu, secara umum 1 - J ( C ) 1-J(C)1J(C) ditelepon Bahasa InggrisCTentang tolok ukur TTT tingkat kesalahan keseluruhan.Oleh karena itu, akurasi pengelompokan J(C) J(C)J(C) Tingkat kesalahan besar atau keseluruhan 1 - J ( C ) 1-J(C)1J(C) Kecil, hal ini menunjukkan bahwa algoritma clustering dapat mengelompokkan objek dari kategori yang berbeda ke dalam cluster yang berbeda dengan lebih baik, yaitu akurasi clustering yang tinggi.

(3) Evaluasi kualitas internal

Tidak ada tolak ukur eksternal yang diketahui untuk evaluasi kualitas internal, hanya kumpulan data yang digunakan Bahasa InggrisSdan pengelompokan Bahasa InggrisCUntuk mengevaluasi karakteristik dan besaran intrinsik suatu klaster Bahasa InggrisC kualitas dari. Artinya, efek pengelompokan umumnya dievaluasi dengan menghitung rata-rata kesamaan dalam cluster, rata-rata kesamaan antar cluster, atau kesamaan keseluruhan.
Evaluasi kualitas internal terkait dengan algoritma clustering. Indeks efektivitas clustering terutama digunakan untuk mengevaluasi kualitas efek clustering atau untuk menilai jumlah cluster yang optimal. Efek clustering yang ideal adalah memiliki jarak intra-cluster terkecil dan cluster terbesar. Oleh karena itu, efektivitas clustering umumnya diukur dengan suatu bentuk rasio jarak intra-cluster dan jarak antar-cluster. Indikator jenis ini yang umum digunakan antara lain indikator CH, indikator Dunn, indikator I, indikator Xie-eni, dll.

1. Indikator CH

Indeks CH adalah singkatan dari indeks Calinski-Harabasz. Indeks CH pertama-tama menghitung jumlah kuadrat jarak antara setiap titik cluster dan pusat cluster untuk mengukur kedekatan dalam kelas; antara setiap titik pusat cluster dan titik pusat kumpulan data untuk mengukur Pemisahan kumpulan data, dan rasio pemisahan terhadap kedekatan adalah indeks CH.
mempersiapkan X ‾ i garis bawahi{X}_iXSayamewakili sebuah cluster Bahasa InggrisCtitik pusat (rata-rata), X ‾ garis atas{X}Xmewakili kumpulan data Bahasa InggrisStitik pusat dari d ( X ‾ i , X ‾ ) d(garis atas{X}_i,garis atas{X})D(XSaya,X) untuk X ‾ i garis bawahi{X}_iXSayatiba X ‾ garis atas{X}XFungsi jarak tertentu, lalu pengelompokan Bahasa InggrisCKekompakan cluster menengah didefinisikan sebagai
Jejak ( A ) = ∑ i = 1 k ∑ X j ∈ C id ( X j , X ‾ i ) 2 (10-24) text{Jejak}(A)=jumlah_{i=1}^ksum_{X_jin C_i}d(X_j,overline{X}_i)^2tag{10-24}Jejak(A)=Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuXJCSayaD(XJ,XSaya)2(10-24) Oleh karena itu, Trace(A) adalah clusternya Bahasa InggrisC Jumlah kuadrat jarak antara pusat cluster.Dan pengelompokan Bahasa InggrisCTingkat pemisahan didefinisikan sebagai
Jejak ( B ) = ∑ i = 1 k ∣ C i ∣ d ( X ‾ i , X ‾ ) 2 (10-25) teks{Jejak}(B)=jumlah_{i=1}^k|C_i|d(garis_atas{X}_i,garis_atas{X})^2tag{10-25}Jejak(B)=Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSayaD(XSaya,X)2(10-25) Artinya, Trace(B) sedang mengelompok Bahasa InggrisCSetiap titik pusat cluster Bahasa InggrisSJumlah tertimbang jarak kuadrat dari titik pusat .
Dari sini, jika N = ∑ i = 1 k ∣ C i ∣N=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuSaya=1|CSaya| N=Saya=1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuCSaya Maka indikator CH dapat didefinisikan sebagai
V CH ( k ) = Jejak ( B ) / ( k − 1 ) Jejak ( A ) / ( N − k ) (10-26) V_{teks{CH}}(k)=frac{teks{Jejak}(B)/(k-1)}{teks{Jejak}(A)/(Nk)}tag{10-26}Bahasa Indonesia: Bahasa Indonesia: VBahasa Inggris(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)=Jejak(A)/(Naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)Jejak(B)/(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu1)(10-26) Rumus (10-26) umumnya digunakan dalam dua situasi berikut:
(1) Evaluasi pengelompokan mana yang diperoleh kedua algoritma yang lebih baik.
Misalkan dua algoritma digunakan untuk menganalisis kumpulan data Bahasa InggrisSAnalisis cluster dilakukan dan dua cluster yang berbeda (keduanya berisi kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuucluster), maka clustering yang sesuai dengan nilai CH yang lebih besar akan lebih baik, karena semakin besar nilai CH berarti semakin dekat setiap cluster dalam cluster tersebut, dan cluster tersebut semakin tersebar.
(2) Evaluasi mana di antara dua cluster dengan jumlah cluster berbeda yang diperoleh dengan algoritma yang sama yang lebih baik.
Asumsikan bahwa suatu algoritma memiliki kumpulan data Bahasa InggrisSAnalisis cluster dilakukan dan jumlah cluster diperoleh sebagai k 1 k_1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu1Dan b_2 dalam bahasa IndonesiaB2 Dari kedua cluster tersebut, hasil clustering dengan nilai CH yang lebih besar lebih baik, yang berarti jumlah cluster yang sesuai dengan cluster tersebut lebih tepat.Oleh karena itu, dengan menerapkan rumus (10-26) berulang kali, kita juga dapat memperoleh kumpulan data Bahasa InggrisSJumlah cluster yang optimal untuk clustering.

2. Indikator Dunn

Indikator Dunn menggunakan cluster C saya C_iCSayadengan cluster C_j (Bahasa Indonesia)CJjarak minimum antar ds(C_i, C_j)DS(CSaya,CJ) untuk menghitung pemisahan antar cluster dengan menggunakan diameter cluster terbesar di antara semua cluster maks ⁡ { Φ ( C 1 ) , Φ ( C 2 ) , . . . , Φ ( C k ) } maks{varPhi(C_1), varPhi(C_2),...,varPhi(C_k)}maks{Φ(C1),Φ(C2),...,Φ(Caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)} Untuk mengkarakterisasi keketatan dalam suatu klaster, indeks Dunn adalah nilai minimum dari rasio antara klaster pertama dan klaster, yaitu
Misalkan k = min ⁡ i ≠ jds ( C i , C j ) maks ⁡ { Φ ( C 1 ) , Φ ( C 2 ) , . . . , Φ ( C k ) } (10-27) V_D(k)=min_{i≠j}frac{d_s(C_i,C_j)}{maks{variabelPhi(C_1), variabelPhi(C_2),...,variabelPhi (C_k)}}tanda{10-27}Bahasa Indonesia: Bahasa Indonesia: VD(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)=Saya=Jmenitmaks{Φ(C1),Φ(C2),...,Φ(Caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)}DS(CSaya,CJ)(10-27) Semakin besar nilai Dunn maka semakin jauh jarak antar cluster dan semakin baik pula clustering yang bersangkutan.Mirip dengan indeks evaluasi CH, indeks Dunn dapat digunakan untuk mengevaluasi kualitas cluster yang diperoleh dengan algoritma yang berbeda, dan juga dapat digunakan untuk mengevaluasi cluster mana yang diperoleh dengan algoritma yang sama dengan jumlah cluster yang berbeda yang lebih baik, yaitu dapat digunakan untuk mencari Bahasa InggrisSjumlah cluster yang optimal.

6. Penambangan outlier

Outlier adalah data khusus dalam kumpulan data yang menyimpang secara signifikan dari sebagian besar data. Fokus dari algoritma data mining seperti klasifikasi dan clustering yang diperkenalkan sebelumnya adalah untuk menemukan pola reguler yang berlaku untuk sebagian besar data. Oleh karena itu, banyak algoritma data mining mencoba untuk mengurangi atau menghilangkan dampak outlier dan mengurangi outlier ketika mengimplementasikan Points atau diabaikan sebagai kebisingan, namun dalam banyak penerapan praktis, orang menduga bahwa penyimpangan titik outlier tidak disebabkan oleh faktor acak, tetapi mungkin disebabkan oleh mekanisme lain yang sama sekali berbeda, yang perlu digali untuk analisis dan pemanfaatan khusus. Misalnya, dalam bidang aplikasi seperti manajemen keamanan dan pengendalian risiko, pola identifikasi outlier lebih berharga dibandingkan pola data normal.

(1) Ikhtisar permasalahan terkait

Kata Outlier biasanya diterjemahkan sebagai outlier, tetapi juga sebagai anomali. Namun, ada banyak alias dalam situasi aplikasi yang berbeda, seperti titik terisolasi, titik abnormal, titik baru, titik deviasi, titik pengecualian, kebisingan, data abnormal, dll. Penambangan outlier memiliki istilah serupa seperti penambangan data anomali, deteksi data anomali, penambangan data outlier, penambangan data pengecualian, dan penambangan peristiwa langka dalam literatur Tiongkok.

1. Generasi outlier

(1) Data berasal dari anomali yang disebabkan oleh penipuan, intrusi, wabah penyakit, hasil eksperimen yang tidak biasa, dll. Misalnya, tagihan telepon rata-rata seseorang adalah sekitar 200 yuan, tetapi tiba-tiba meningkat menjadi beberapa ribu yuan pada bulan tertentu; kartu kredit seseorang biasanya menghabiskan sekitar 5.000 yuan sebulan, tetapi pada bulan tertentu konsumsinya melebihi 30.000 yuan, dll. Pencilan seperti itu biasanya relatif menarik dalam penambangan data dan merupakan salah satu poin penting penerapannya.
(2) Disebabkan oleh perubahan inheren pada variabel data, yang mencerminkan karakteristik alami distribusi data, seperti perubahan iklim, pola pembelian pelanggan baru, mutasi genetik, dll. Juga salah satu area fokus yang menarik.
(3) Kesalahan pengukuran dan pengumpulan data terutama disebabkan oleh kesalahan manusia, kegagalan peralatan pengukuran, atau adanya kebisingan. Misalnya, nilai siswa sebesar -100 dalam mata kuliah tertentu mungkin disebabkan oleh nilai default yang ditetapkan oleh program; gaji manajer puncak sebuah perusahaan jauh lebih tinggi daripada gaji karyawan biasa mungkin tampak seperti hal yang aneh, tetapi memang demikian Data yang masuk akal.

2. Masalah penambangan outlier

Biasanya, masalah penambangan outlier dapat dipecah menjadi tiga sub-masalah untuk dijelaskan.
(1) Definisikan outlier
Karena outlier berkaitan erat dengan masalah praktis, mendefinisikan dengan jelas jenis data apa yang merupakan outlier atau data abnormal adalah premis dan tugas utama penambangan outlier. Secara umum, pengalaman dan pengetahuan pakar domain perlu digabungkan untuk memberikan analisis outlier yang akurat .Berikan deskripsi atau definisi yang sesuai.
(2) Penambangan outlier
Setelah titik outlier ditentukan dengan jelas, algoritma apa yang digunakan untuk mengidentifikasi atau menambang titik outlier yang ditentukan secara efektif adalah tugas utama penambangan outlier. Algoritme penambangan outlier biasanya memberikan data outlier yang mencurigakan kepada pengguna dari sudut pandang pola yang dapat tercermin dalam data tersebut, sehingga dapat menarik perhatian pengguna.
(3) Memahami outlier
Penjelasan yang masuk akal, pemahaman dan panduan penerapan praktis hasil penambangan adalah tujuan dari penambangan outlier. Karena mekanisme yang digunakan untuk menghasilkan outlier tidak pasti, apakah "outlier" yang terdeteksi oleh algoritma penambangan outlier benar-benar sesuai dengan perilaku abnormal yang sebenarnya tidak dapat dijelaskan dan dijelaskan oleh algoritma penambangan outlier, tetapi hanya dapat dijelaskan oleh algoritma penambangan outlier. . Pakar industri atau domain untuk memahami dan menjelaskan instruksi.

3. Relativitas outlier

Outlier adalah data khusus dalam kumpulan data yang jelas-jelas menyimpang dari sebagian besar data, tetapi "jelas" dan "sebagian besar" bersifat relatif, yaitu meskipun outlier berbeda, namun tetap relatif. Oleh karena itu, ada beberapa masalah yang perlu dipertimbangkan ketika mendefinisikan dan menambang outlier.
(1) Pencilan global atau lokal
Suatu objek data mungkin merupakan outlier dibandingkan dengan tetangga lokalnya, namun tidak relatif terhadap keseluruhan kumpulan data. Misalnya, seorang siswa yang tingginya 1,9 meter adalah orang asing di Kelas 1 jurusan matematika sekolah kami, tetapi tidak di antara orang-orang di seluruh negeri, termasuk pemain profesional seperti Yao Ming.
(2) Jumlah outlier
Meskipun jumlah titik outlier tidak diketahui, jumlah titik normal seharusnya jauh melebihi jumlah titik outlier. Artinya, jumlah titik outlier harus memiliki proporsi yang lebih rendah dalam kumpulan data yang besar poin outlier Seharusnya kurang dari 5% atau bahkan kurang dari 1%.
(3) Faktor titik outlier
Anda tidak dapat menggunakan "ya" atau "tidak" untuk melaporkan apakah suatu objek merupakan outlier. Sebaliknya, Anda harus menggunakan tingkat deviasi objek tersebut, yaitu faktor outlier (Faktor Outlier) atau skor outlier (Skor Outlier) untuk mengkarakterisasi deviasi suatu data dari derajat kelompok, dan kemudian menyaring objek dengan faktor outlier yang lebih tinggi dari ambang batas tertentu, memberikannya kepada pengambil keputusan atau pakar domain untuk dipahami dan dijelaskan, dan menerapkannya dalam kerja praktek.

(2) Metode berbasis jarak

1. Konsep dasar

Definisi 10-11 Ada bilangan bulat positif kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, objek Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Jarak tetangga terdekat adalah bilangan bulat positif yang memenuhi kondisi berikut dk(X) dk(X)Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X)
(1) kecuali Bahasa Indonesia: XXXSelain itu, setidaknya ada kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuobjek Y Ykamumemuaskan Bahasa Indonesia: d(X,Y) ≤ dk(X)D(X,kamu)Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X)
(2) kecuali Bahasa Indonesia: XXXSelain itu, ada paling banyak k - 1 k - 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu1 objek Y Ykamumemuaskan Bahasa Indonesia: d(X,Y)D(X,kamu)<Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X)
di dalam d(X,Y)D(X,kamu) adalah sebuah objek Bahasa Indonesia: XXXDan Y Ykamubeberapa fungsi jarak di antara mereka.

dari suatu objek kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Semakin besar jarak tetangga terdekat maka semakin besar kemungkinan objek tersebut jauh dari sebagian besar data, sehingga objek tersebut dapat Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-jarak tetangga terdekat dk(X) dk(X)Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X) sebagai faktor outliernya.

Definisi 10-12 membuat D(X,k) = {Y ∣ d(X,Y) ≤ dk(X) ∧ Y ≠ X} D(X,k)={Y|d(X,Y)≤d_k(X)baji Y≠X}D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)={kamuD(X,kamu)Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X)kamu=X}, lalu disebut D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) Ya Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Tetangga Terdekat (Domain).

Dapat dilihat dari definisi 10-12 bahwa D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) Ya Bahasa Indonesia: XXXsebagai pusat, jarak Bahasa Indonesia: XXXTidak melebihi dk(X) dk(X)Daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(X) Obyek Y Ykamu Koleksinya terdiri dari. Perlu memberikan perhatian khusus pada, Bahasa Indonesia: XXXbukan miliknya kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-tetangga terdekat yaitu X ∉ D ( X , k ) X tidak dalam D(X,k)X/D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) . Secara khusus, Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-tetangga terdekat D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) Jumlah benda yang dikandungnya mungkin jauh melebihi kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu,Sekarang Tentukan nilai D(X,k)D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

Definisi 10-13 Ada bilangan bulat positif kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, objek Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Faktor outlier tetangga terdekat didefinisikan sebagai
DARI 1 ( X , k ) = ∑ Y ∈ D ( X , k ) d ( X , Y ) ∣ D ( X , k ) ∣ (10-28) teks{OF}_1(X,k)=frac{jumlah{mathop}limit_{Yin D(X,k)}d(X,Y)}{|D(X,k)|}tag{10-28}DARI1(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)=D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)kamuD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)D(X,kamu)(10-28)

2. Deskripsi algoritma

Untuk kumpulan data tertentu dan jumlah jarak tetangga terdekat kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, kita bisa menggunakan rumus di atas untuk menghitungnya kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Faktor outlier tetangga terdekat, dan mengurutkannya dari besar ke kecil. Diantaranya, beberapa objek dengan faktor outlier yang lebih besar kemungkinan besar merupakan outlier. Umumnya, objek tersebut perlu dianalisis dan dinilai oleh pengambil keputusan atau pakar industri , Poin mana yang benar-benar outlier.

Algoritma 10-8 Algoritma deteksi outlier berbasis jarak
Masukan: kumpulan data Bahasa InggrisS, jumlah jarak tetangga terdekat kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
Keluaran: Daftar titik-titik outlier yang dicurigai dan faktor-faktor outlier yang terkait dalam urutan menurun
(1) ULANGI
(2) Ambil Bahasa InggrisSobjek yang belum diproses di Bahasa Indonesia: XXX
(3) Oke Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-tetangga terdekat D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(4) Perhitungan Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-faktor outlier tetangga terdekat DARI 1 ( X , k ) teks{OF}_1(X,k)DARI1(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(5)SAMPAI Bahasa InggrisSSetiap poin masuk telah diproses
(6) Ya DARI 1 ( X , k ) teks{OF}_1(X,k)DARI1(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)Urutkan dalam urutan menurun dan keluaran ( X , DARI 1 ( X , k ) ) (X,teks{DARI}_1(X,k))(X,DARI1(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu))

3. Contoh perhitungan

Contoh 10-12 Kumpulan data dua dimensi dengan 11 titik Bahasa InggrisSHal ini diberikan oleh Tabel 10-10, misalkan k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2, gunakan penghitungan kuadrat jarak Euclidean X7, X10, X11X_7, X_{10},X_{11}X7,X10,X11 Faktor outlier terhadap semua poin lainnya.

Masukkan deskripsi gambar di sini
membuka: Untuk memahami prinsip algoritma secara intuitif, kami akan melakukannya Bahasa InggrisSObjek data ditampilkan pada bidang pada Gambar (10-27) di bawah.

Masukkan deskripsi gambar di sini
Berikut ini menghitung masing-masing faktor outlier dari titik tertentu dan titik lainnya.

(1) Objek perhitungan X_7_7_KelasX7faktor outlier
Seperti yang terlihat dari gambar, jaraknya X7 = (6, 8) X_7=(6,8)X7=(6,8) Titik terdekat adalah X10 = (5, 7) X_{10}=(5,7)X10=(5,7),Dan Tentukan nilai x7 dan x10 jika x10 = 1,41D(X7,X10)=1.41, titik terdekat lainnya mungkin X_11 = ( 5 , 2 ) X_{11}=(5,2)X11=(5,2) X9 = ( 3 , 2 ) X9=(3,2)X9=(3,2) X8 = ( 2 , 4 ) X8=(2,4)X8=(2,4)
Dihitung Tentukan nilai d(X_7, X_{11}) = 6,08D(X7,X11)=6.08 Tentukan nilai x7 dan x9 jika diketahuiD(X7,X9)=6.71 Tentukan nilai d(X_7, X_8) = 5,66D(X7,X8)=5.66
Karena k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2,Jadi Persamaan kuadrat dari d2(X7) adalah:D2(X7)=5.66, jadi menurut definisi 10-11 kita punya Tentukanlah nilai X_7, 2 = { X_10, X_8 }D(X7,2)={X10,X8}
Menurut rumus (10-28), X_7_7_KelasX7faktor outlier
DARI 1 ( X 7 , 2 ) = ∑ Y ∈ N ( X 7 , 2 ) d ( X 7 , Y ) ∣ N ( X 7 , k ) ∣ = d ( X 7 , X 10 ) + d ( X 7 , X 8 ) 2 = 1,41 + 5,66 2 = 3,54DARI1(X7,2)=kamuN(X7,2)D(X7,kamu)|N(X7,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)|=D(X7,X10)+D(X7,X8)2=1.41+5.662=3.54 DARI1(X7,2)=N(X7,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)kamuN(X7,2)D(X7,kamu)=2D(X7,X10)+D(X7,X8)=21.41+5.66=3.54(2) Objek perhitungan X10 X_{10} dan seterusnyaX10faktor outlier DARI 1 ( X 10 , 2 ) = 2,83 teks{OF}_1(X_{10},2)=2,83DARI1(X10,2)=2.83

(3) Objek perhitungan X11 X_{11}X11faktor outlier DARI 1 ( X 11 , 2 ) = 2,5 teks{OF}_1(X_{11},2)=2,5DARI1(X11,2)=2.5

(4) Objek perhitungan X 5 X_{5}X5faktor outlier DARI 1 ( X 5 , 2 ) = 1 teks{OF}_1(X_{5},2)=1DARI1(X5,2)=1

Demikian pula faktor outlier dari objek yang tersisa dapat dihitung, lihat tabel berikut (10-11).

Masukkan deskripsi gambar di sini
4. Ambang batas faktor outlier

berdasarkan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu -Teori tetangga terdekat, semakin besar faktor outlier, semakin besar kemungkinannya merupakan outlier. Oleh karena itu, ambang batas harus ditentukan untuk membedakan outlier dari titik normal. Metode yang paling sederhana adalah dengan menentukan jumlah titik outlier, namun metode ini terlalu sederhana dan kadang-kadang melewatkan beberapa titik outlier yang sebenarnya atau mengatribusikan terlalu banyak titik normal ke kemungkinan titik outlier, sehingga menyulitkan pakar domain atau pengambil keputusan untuk mengalami kesulitan. dalam memahami dan menafsirkan outlier.
(1) Metode ambang batas segmentasi faktor outlier terlebih dahulu menyusun faktor-faktor outlier dalam urutan menurun, dan pada saat yang sama menomori ulang objek data dalam urutan menaik sesuai dengan faktor-faktor outlier.
(2) Berdasarkan faktor outlier DARI 1 ( X , k ) teks{OF}_1(X,k)DARI1(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) adalah ordinatnya, dan nomor urut faktor outliernya adalah absis, yaitu (nomor urut, DARI 1 teks{OF}_1DARI1nilai) ditandai pada bidang dan dihubungkan untuk membentuk polyline yang tidak meningkat, dan titik di mana polyline berpotongan dengan penurunan yang tajam dan penurunan yang lembut ditemukan sesuai dengan faktor outlier sebagai ambang batas daripada atau sama dengan ambang batas ini adalah objek normal, maka objek lainnya kemungkinan merupakan outlier.

Contoh 10-13 Kumpulan data untuk Contoh 10-12 Bahasa InggrisS , faktor-faktor outliernya dirangkum dalam urutan menurun dan nomor seri pada Tabel 10-11. Cobalah untuk mencari ambang batas titik outlier berdasarkan metode ambang batas segmentasi faktor outlier.

membuka: Pertama, gunakan (nomor seri, DARI 1 teks{OF}_1DARI1 nilai) sebagai titik pada bidang, ditandai pada bidang dan dihubungkan dengan garis poli. Seperti terlihat pada Gambar 10-28 di bawah ini.

Masukkan deskripsi gambar di sini
Kemudian dengan melihat Gambar 10-28, kita dapat menemukan bahwa polyline di sebelah kiri titik keempat (4, 1.27) turun sangat tajam, sedangkan polyline di sebelah kanan turun dengan sangat perlahan. Oleh karena itu, faktor outlier 1.27 dipilih sebagai ambang.Karena X7 dan X10 X_7 dan X_{10}X7X10 Dan X11 X_{11}X11 Faktor outlier masing-masing adalah 3,54, 2,83, dan 2,5, yang semuanya lebih besar dari 1,27. Oleh karena itu, ketiga poin tersebut kemungkinan besar merupakan poin outlier, sedangkan poin sisanya merupakan poin biasa.
Melihat kembali Gambar 10-27, kita dapat menemukannya X7 dan X10 X_7 dan X_{10}X7X10 Dan X11 X_{11}X11 memang jauh dari sebagian besar objek di sebelah kiri, jadi perlakukan objek tersebut sebagai kumpulan data Bahasa InggrisSPencilan tersebut masuk akal.

5. Evaluasi algoritma

Keuntungan terbesar dari metode deteksi outlier berbasis jarak adalah prinsipnya sederhana dan mudah digunakan. Kekurangannya terutama tercermin pada aspek-aspek berikut.
(1) Parameter kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuPemilihan tersebut tidak memiliki metode yang sederhana dan efektif untuk menentukan dampak hasil pengujian terhadap parameter kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuTidak ada hasil analisis yang diterima secara universal mengenai tingkat sensitivitas.
(2) Kompleksitas waktunya adalah Bahasa Indonesia: O(|S|^2)HAI(S2), tidak memiliki skalabilitas untuk kumpulan data berskala besar.
(3) Karena penggunaan ambang batas faktor outlier global, sulit untuk menambang outlier dalam kumpulan data dengan wilayah dengan kepadatan berbeda.

(3) Metode berdasarkan kepadatan relatif

Metode jarak merupakan metode pengecekan outlier secara global, namun tidak dapat menangani kumpulan data pada area dengan kepadatan berbeda, yaitu tidak dapat mendeteksi outlier pada area dengan kepadatan lokal. Ketika kumpulan data berisi beberapa distribusi kepadatan atau merupakan campuran dari himpunan bagian kepadatan yang berbeda, metode deteksi outlier global seperti jarak biasanya tidak berfungsi dengan baik, karena apakah suatu objek merupakan outlier tidak hanya bergantung pada hubungannya dengan data di sekitarnya berkaitan dengan kepadatan di lingkungan tersebut.

1. Konsep kepadatan relatif

Dari sudut pandang kepadatan lingkungan, outlier adalah objek yang berada di daerah dengan kepadatan rendah. Oleh karena itu, perlu diperkenalkan konsep kepadatan lingkungan lokal dan kepadatan relatif objek.

Definisi 10-14 (1) suatu objek Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan lokal tetangga terdekat (densitas) didefinisikan sebagai
dsty ( X , k ) = ∣ D ( X , k ) ∣ ∑ Y ∈ D ( X , k ) d ( X , Y ) (10-29) teks{dsty}(X,k)=frac{|D(X,k)|}{mathop{jumlah}limit_{Yin D(X,k)}d(X,Y)}tag{10-29}dsty(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)=kamuD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)D(X,kamu)D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)(10-29) (2) suatu benda Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan relatif lokal tetangga terdekat (kepadatan relatif)
gaya(X, k) = ∑ Y ∈ D(X, k) gaya(X, k) / ∣ D(X, k) ∣ gaya(X, k) (10-30) teks{gaya(X, k)=frac{jumlah{mathop}limit_{Yin D(X, k)}teks{gaya(X, k)/|D(X, k)|}{teks{gaya(X, k)}tag{10-30}gaya hidup(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)=dsty(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)kamuD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)dsty(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)/∣D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)(10-30) di dalam D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) Itu objeknya Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu- tetangga terdekat (diberikan dalam Definisi 10-12), Misalkan D(X,k) adalah himpunan bagian dari himpunan bagian dari himpunan bagian.D(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) adalah jumlah objek dalam koleksi.

2. Deskripsi algoritma

oleh gaya ked ( X , k ) teks{gaya ked}(X, k)gaya hidup(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu) sebagai orang asing DARI 2 ( X , k ) teks{OF}_2(X,k)DARI2(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu), perhitungannya dibagi menjadi dua langkah
(1) Menurut jumlah tetangga kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu, hitung setiap objek Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan lokal tetangga terdekat gaya(X,k) teks{gaya}(X,k)dsty(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(2) Perhitungan Bahasa Indonesia: XXXkepadatan rata-rata tetangga terdekat dan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan relatif lokal tetangga terdekat gaya ked ( X , k ) teks{gaya ked}(X, k)gaya hidup(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
Kumpulan data terdiri dari beberapa cluster alami. Kepadatan relatif objek yang dekat dengan titik inti di dalam cluster mendekati 1, sedangkan kepadatan relatif objek di tepi cluster atau di luar cluster relatif besar. Oleh karena itu, semakin besar nilai kepadatan relatifnya, semakin besar kemungkinannya merupakan outlier.

Algoritma 10-9 Algoritma deteksi outlier berdasarkan kepadatan relatif
Masukan: kumpulan data Bahasa InggrisS, jumlah tetangga terdekat kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
Keluaran: Daftar titik-titik outlier yang dicurigai dan faktor-faktor outlier yang terkait dalam urutan menurun
(1) ULANGI
(2) Ambil Bahasa InggrisSobjek yang belum diproses di Bahasa Indonesia: XXX
(3) Oke Bahasa Indonesia: XXXdari kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-tetangga terdekat D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(4) Pemanfaatan D(X,k) adalah singkatan dariD(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)menghitung Bahasa Indonesia: XXXKepadatan gaya(X,k) teks{gaya}(X,k)dsty(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(5)SAMPAI Bahasa InggrisSSetiap poin masuk telah diproses
(6) ULANGI
(7) Ambil Bahasa InggrisSobjek pertama masuk Bahasa Indonesia: XXX
(8) Oke Bahasa Indonesia: XXXkepadatan relatif gaya ked ( X , k ) teks{gaya ked}(X, k)gaya hidup(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu), dan tetapkan ke DARI 2 ( X , k ) teks{OF}_2(X,k)DARI2(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)
(9)SAMPAI Bahasa InggrisSSemua objek di telah diproses
(10) Benar DARI 2 ( X , k ) teks{OF}_2(X,k)DARI2(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu)Urutkan dalam urutan menurun dan keluaran ( X , DARI 2 ( X , k ) ) (X,teks{DARI}_2(X,k))(X,DARI2(X,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu))

Contoh 10-14 Untuk kumpulan data dua dimensi yang diberikan pada Contoh 10-12 Bahasa InggrisS (Lihat Tabel 10-10 untuk rinciannya), jadi k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2, coba hitung jarak Euclidean X7, X10, X11X_7, X_{10},X_{11}X7,X10,X11 Faktor outlier berdasarkan kepadatan relatif benda-benda yang sama.

Masukkan deskripsi gambar di sini
membuka:Karena k=2 k=2 = 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2, jadi kita memerlukan kepadatan lokal 2 tetangga terdekat dari semua objek.

(1) Temukan 2 tetangga terdekat dari setiap objek data pada Tabel 10-11 Tentukan nilai X_i,2D(XSaya,2)
Berdasarkan metode perhitungan yang sama pada Contoh 10-12, kita dapat memperolehnya
Bahasa Indonesia: D(X1,2) = {X2,X3,X5}, D(X2,2) = {X1,X6}, D(X3,2) = {X1,X4}, D(X4,2) = {X3,X5}, D(X5,2) = {X1,X4,X6,X9}, D(X6,2) = {X2,X5,X8}, D(X7,2) = {X10,X8}, D(X8,2) = {X2,X6}, D(X9) Diketahui D(X10,2) = {X7,X8}, D(X11,2) = {X9,X5}D(X1,2)={X2,X3,X5}D(X2,2)={X1,X6}              D(X3,2)={X1,X4}D(X4,2)={X3,X5}       D(X5,2)={X1,X4,X6,X9}D(X6,2)={X2,X5,X8}D(X7,2)={X10,X8}     D(X8,2)={X2,X6}               D(X9,2)={X5,X4,X6}D(X10,2)={X7,X8}     D(X11,2)={X9,X5} D(X1,2)={X2,X3,X5}D(X2,2)={X1,X6}              D(X3,2)={X1,X4}D(X4,2)={X3,X5}       D(X5,2)={X1,X4,X6,X9}D(X6,2)={X2,X5,X8}D(X7,2)={X10,X8}     D(X8,2)={X2,X6}               D(X9,2)={X5,X4,X6}D(X10,2)={X7,X8}     D(X11,2)={X9,X5}

(2) Hitung kepadatan lokal setiap objek data gaya(X i, 2) teks{gaya}(X_i,2)dsty(XSaya,2)

① Hitung X_1 ...X1Kepadatan
Karena Tentukanlah persamaan (X1, X2, X3, X5) berikut!D(X1,2)={X2,X3,X5}, jadi setelah dihitung, kita punya Rumus untuk mencari x1 dan x2 adalah 1, yaitu:D(X1,X2)=1 Persamaan (x_1,x_3) = 1D(X1,X3)=1 Rumus untuk mencari x1 dan x5 adalah 1, maka x1 adalah 1.D(X1,X5)=1
Menurut rumus (10-29), kita memperoleh:
Diketahui ( X 1 , 2 ) = ∣ D ( X 1 , 2 ) ∣ ∑ Y ∈ N ( X 1 , 2 ) d ( X 1 , Y ) = ∣ N ( X 1 , 2 ) ∣ d ( X 1 , X 2 ) + d ( X 1 , X 3 ) + d ( X 1 , X 5 ) = 3 1 + 1 + 1 = 1dsty(X1,2)=|D(X1,2)|kamuN(X1,2)D(X1,kamu)=|N(X1,2)|D(X1,X2)+D(X1,X3)+D(X1,X5)=31+1+1=1 dsty(X1,2)=kamuN(X1,2)D(X1,kamu)D(X1,2)=D(X1,X2)+D(X1,X3)+D(X1,X5)N(X1,2)=1+1+13=1

② Perhitungan X2 X_2 = 2X2Kepadatan
Karena Tentukan nilai X2, X6 dan X3!D(X2,2)={X1,X6}, jadi dihitung Rumus untuk mencari x2 dan x1 adalah 1.D(X2,X1)=1 Tentukan x2, x6 = 1D(X2,X6)=1
Menurut rumus (10-29), kita memperoleh:
Diketahui ( X 2 , 2 ) = ∣ D ( X 2 , 2 ) ∣ ∑ Y ∈ N ( X 2 , 2 ) d ( X 2 , Y ) = 2 1 + 1 = 1dsty(X2,2)=|D(X2,2)|kamuN(X2,2)D(X2,kamu)=21+1=1 dsty(X2,2)=kamuN(X2,2)D(X2,kamu)D(X2,2)=1+12=1

Kepadatan lokal objek data lainnya dapat dihitung dengan cara yang sama, lihat Tabel 10-12 di bawah.

Masukkan deskripsi gambar di sini
(3) Hitung setiap benda X saya X_iXSayakepadatan relatif gaya ke-2 ( X i , 2 ) teks{gaya ke-2}(X_i, 2)gaya hidup(XSaya,2), dan menganggapnya sebagai faktor outlier DARI 2 teks{OF}_2DARI2
① Hitung X_1 ...X1kepadatan relatif
Menggunakan nilai massa jenis masing-masing benda pada Tabel 10-12, sesuai dengan rumus massa jenis relatif (10-30):
gaya(X1,2) = ∑Y∈N(X1,2) gaya(Y,2) / ∣N(X1,2) ∣ gaya(X1,2) = (1+1+1)/3 1 = 1 = DARI 2(X1,2)gaya hidup(X1,2)=kamuN(X1,2)dsty(kamu,2)/|N(X1,2)|dsty(X1,2)=(1+1+1)/31=1=DARI2(X1,2) gaya hidup(X1,2)=dsty(X1,2)kamuN(X1,2)dsty(kamu,2)/∣N(X1,2)=1(1+1+1)/3=1=DARI2(X1,2)

② Perhitungan serupa dapat diperoleh X_2, X_3, …, X_11, X_2, X_3, …, X_{11}X2X3X11 nilai kepadatan relatif.
Misalnya X_5 X_5 tidak ditemukanX5Kepadatan relatif dari:
gaya(X5,2) = ∑Y∈N(X5,2) gaya(Y,2) / ∣N(X5,2) ∣ gaya(X5,2) = (1+1+1+0,79)/41 = 0,95 = DARI2(X5,2)gaya hidup(X5,2)=kamuN(X5,2)dsty(kamu,2)/|N(X5,2)|dsty(X5,2)=(1+1+1+0.79)/41=0.95=DARI2(X5,2) gaya hidup(X5,2)=dsty(X5,2)kamuN(X5,2)dsty(kamu,2)/∣N(X5,2)=1(1+1+1+0.79)/4=0.95=DARI2(X5,2) Hasilnya dirangkum dalam Tabel 10-13 di bawah ini.

Masukkan deskripsi gambar di sini
Contoh 10-15 Mengingat kumpulan data yang ditunjukkan pada Tabel 10-14, silakan gunakan jarak Euclidean untuk k=2, 3, 5 k=2,3,5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu=2,3,5, hitung nilai setiap poin kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-kepadatan lokal tetangga terdekat, kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan relatif lokal tetangga terdekat (faktor outlier DARI 2 teks{OF}_2DARI2) dan berdasarkan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Faktor outlier untuk jarak tetangga terdekat DARI 1 teks{OF}_1DARI1

Masukkan deskripsi gambar di sini
membuka: (1) Untuk memudahkan pemahaman dapat Bahasa InggrisSPosisi relatif titik-titik ditandai pada bidang dua dimensi (Gambar 10-30).

Masukkan deskripsi gambar di sini
(2) Gunakan algoritma berbasis jarak dan kepadatan relatif masing-masing 10-8 dan 10-9.Hitung setiap objek secara terpisah kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan lokal tetangga terdekat teks dsty{dsty}dsty kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Kepadatan relatif lokal tetangga terdekat (faktor outlier DARI 2 teks{OF}_2DARI2) dan berdasarkan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Faktor outlier untuk jarak tetangga terdekat DARI 1 teks{OF}_1DARI1, hasilnya dirangkum dalam Tabel 10-15.

Masukkan deskripsi gambar di sini
(3) Analisis sederhana
① Seperti yang terlihat pada Gambar 10-30, X 15 X_{15}X15Dan X 16 X_{16}X16Ya Bahasa InggrisSAda dua outlier yang jelas, dan metode berdasarkan jarak dan kepadatan relatif dapat menggalinya dengan lebih baik;
② Dari contoh ini, kedua algoritma tersebut memiliki kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuutidak sesensitif yang diharapkan, mungkin ini outlier. X 15 X_{15}X15Dan X 16 X_{16}X16Pemisahan dari objek lain terlihat sangat jelas.
③Seperti yang terlihat pada Tabel 10-15, tidak masalah kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuAmbil 2, 3 atau 5, X_1 ...X1wilayah tersebut teks dsty{dsty}dsty nilainya jauh lebih rendah dari X_7_7_KelasX7wilayah tersebut teks dsty{dsty}dsty nilai, yang konsisten dengan kepadatan area yang ditunjukkan pada Gambar 10-30.Namun nilai kepadatan relatif kedua wilayah tersebut DARI 2 teks{OF}_2DARI2 Namun hampir tidak ada perbedaan yang nyata. Hal ini ditentukan oleh sifat kerapatan relatif, yaitu untuk titik data yang terdistribusi merata, kerapatan relatif titik inti adalah 1, berapa pun jarak antar titik.

7. Metode pengelompokan lainnya

1. Peningkatan algoritma pengelompokan

  (1) kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-mod ( kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-modes) algoritma untuk kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu -Algoritma rata-rata hanya cocok untuk batasan atribut numerik dan diusulkan untuk mencapai pengelompokan data diskrit dengan cepat.Karena kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritme modular menggunakan metode pencocokan 0-1 sederhana untuk menghitung jarak antara dua nilai atribut pada atribut diskrit yang sama, yang melemahkan perbedaan antara nilai atribut ordinal, yaitu tidak dapat sepenuhnya mencerminkan perbedaan antara dua nilai atribut ​​di bawah atribut ordinal yang sama. Masih ada ruang untuk perbaikan dan perbaikan.
  (2) kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-prototipe ( kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Prototipe) algoritma dikombinasikan dengan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritma rata-rata dengan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu -Keuntungan dari algoritma modular adalah dapat mengelompokkan kumpulan data dengan atribut diskrit dan numerik (disebut atribut campuran).Dibutuhkan untuk atribut diskrit kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Objek perhitungan algoritma modular Bahasa Indonesia: XXXDan Y Ykamujarak antara Bahasa Indonesia: d_1(X,Y)D1(X,kamu), untuk atribut numerik, gunakan kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Metode dalam algoritma rata-rata menghitung jarak antar objek Bahasa Indonesia: d_2(X,Y)D2(X,kamu), dan terakhir menggunakan metode pembobotan, yaitu Bahasa Indonesia: α d 1 ( X , Y ) + ( 1 − α ) d 2 ( X , Y ) alfa d_1(X,Y)+(1-alfa)d_2(X,Y)αD1(X,kamu)+(1α)D2(X,kamu) sebagai objek kumpulan data Bahasa Indonesia: XXXDan Y Ykamujarak antara d(X,Y)D(X,kamu),di dalam α ∈ [ 0 , 1 ] alphain[0,1]α[0,1] adalah koefisien bobot, biasanya bisa α = 0,5 alfa = 0,5α=0.5
(3) Algoritma BIRCH (Balanced Iterative Reducing and Clustering Used Hierarchies) adalah metode pengelompokan hierarki yang komprehensif.Ia menggunakan Clustering Features (CF) dan Clustering Feature Tree (CF Tree, mirip dengan B-tree) untuk meringkas cluster-cluster. C saya C_iCSaya,di dalam CF i = ( ni , LS i , SS i ) teks{CF}_i=(ni, teks{LS}_i, teks{SS}_i)Bahasa InggrisSaya=(ini,Bahasa InggrisSaya,Bahasa InggrisSaya) adalah kembar tiga, ini n_iNSayaadalah jumlah objek dalam cluster, LS saya teks{LS}_iBahasa InggrisSayaYa ini n_iNSayajumlah linier komponen objek, SS saya teks{SS}_iBahasa InggrisSayaYa ini n_iNSayaJumlah kuadrat komponen-komponen suatu benda.
(4) Algoritma CURE (Clustering Menggunakan Representatives) adalah untuk kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu -Peningkatan lain pada algoritma rata-rata. Banyak algoritma pengelompokan yang hanya bagus dalam pengelompokan kelompok bola, sementara beberapa algoritma pengelompokan lebih sensitif terhadap titik-titik yang terisolasi. Untuk menyelesaikan dua permasalahan di atas, algoritma CURE telah diubah kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritma rata-rata menggunakan jumlah pusat cluster kkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaakuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu-Algoritma titik pusat menggunakan satu objek spesifik untuk mewakili sebuah cluster, metode tradisional, tetapi menggunakan beberapa objek perwakilan dalam cluster untuk mewakili sebuah cluster, sehingga dapat beradaptasi dengan pengelompokan cluster non-bola dan mengurangi dampak dari kebisingan pada pengelompokan.
(5) Algoritma ROCK (RObust Clustering menggunakan linK) adalah algoritma pengelompokan yang diusulkan untuk kumpulan data atribut biner atau kategorikal.
(6) Algoritma OPTICS (Ordering Points To Identification the Clustering Structure) digunakan untuk mereduksi kepadatan algoritma DBSCAN. ( ε , MinPts ) (varepsilon,teks{MinPts})(ε,Poin Minimum) sensitivitas parameter. Ini tidak secara eksplisit menghasilkan klaster hasil, namun menghasilkan peringkat klaster yang diperbesar untuk analisis klaster (misalnya, bagan koordinat dengan jarak yang dapat dijangkau sebagai sumbu vertikal dan urutan keluaran titik sampel sebagai sumbu horizontal). Pemeringkatan ini mewakili struktur pengelompokan berbasis kepadatan pada setiap titik sampel.Kita bisa mendapatkan dari penyortiran ini berdasarkan parameter kepadatan apa pun ( ε , MinPts ) (varepsilon,teks{MinPts})(ε,Poin Minimum) Hasil clustering dari algoritma DBSCAN.

2. Metode pengelompokan baru lainnya

Gunakan beberapa teori atau teknik baru untuk merancang metode pengelompokan baru.

(1) Metode pengelompokan berbasis grid
Metode berbasis grid mengkuantifikasi ruang objek menjadi sejumlah sel terbatas untuk membentuk struktur grid, dan informasi posisi titik pemisah di setiap dimensi disimpan dalam array. Garis pemisah melewati seluruh ruang, dan semua pengelompokan operasi dilakukan di Dilakukan pada struktur grid ini (yaitu ruang kuantisasi). Keuntungan utama dari metode ini adalah kecepatan pemrosesannya yang sangat cepat. Kecepatan pemrosesannya tidak bergantung pada jumlah objek data dan hanya terkait dengan jumlah sel di setiap dimensi ruang kuantifikasi mengorbankan hasil pengelompokan. Karena algoritma pengelompokan grid memiliki masalah skala kuantifikasi, kita biasanya mulai mencari cluster dari unit kecil terlebih dahulu, kemudian secara bertahap meningkatkan ukuran unit, dan mengulangi proses ini hingga ditemukan cluster yang memuaskan.

(2) Metode pengelompokan berbasis model
Metode berbasis model mengasumsikan model untuk setiap cluster dan menemukan data yang paling sesuai dengan model yang diberikan. Metode berbasis model berupaya mengoptimalkan kemampuan adaptasi antara data tertentu dan model data tertentu dengan menetapkan fungsi kepadatan yang mencerminkan distribusi spasial sampel untuk menemukan lokasi cluster.

(3) Metode clustering berdasarkan himpunan fuzzy
Dalam praktiknya, tidak ada nilai atribusi yang ketat di cluster mana sebagian besar objek berada. Ada perantara atau ketidakpastian dalam nilai dan bentuk atribusinya, yang cocok untuk partisi lunak. Karena analisis pengelompokan fuzzy memiliki keunggulan dalam menggambarkan keterhubungan atribusi sampel dan dapat mencerminkan dunia nyata secara objektif, maka analisis ini menjadi salah satu hot spot dalam penelitian analisis klaster saat ini.
Algoritma fuzzy clustering merupakan metode pembelajaran tanpa pengawasan yang didasarkan pada teori matematika fuzzy dan metode clustering tidak pasti. Setelah fuzzy clustering diusulkan, ia mendapat perhatian besar dari komunitas akademis. Fuzzy clustering adalah "keluarga" clustering yang besar, dan penelitian tentang fuzzy clustering juga sangat aktif.

(4) Metode pengelompokan berdasarkan himpunan kasar
Pengelompokan kasar adalah metode pengelompokan tidak pasti yang didasarkan pada teori himpunan kasar. Dari perspektif penggandengan antara himpunan kasar dan algoritma pengelompokan, metode pengelompokan kasar dapat dibagi menjadi dua kategori: pengelompokan kasar penggandengan kuat dan pengelompokan kasar penggandengan lemah.
Tentu saja, arah penelitian baru dalam analisis klaster lebih dari itu. Misalnya, algoritma penambangan dan pengelompokan aliran data, data yang tidak pasti dan algoritma pengelompokannya, komputasi kuantum, dan algoritma pengelompokan genetika kuantum adalah teknologi pengelompokan yang telah muncul dalam beberapa tahun terakhir. . topik penelitian mutakhir.

3. Metode penambangan outlier lainnya

Metode penambangan outlier yang diperkenalkan sebelumnya hanyalah dua perwakilan dari penambangan outlier. Ada banyak metode penambangan outlier yang lebih matang dalam penerapan praktisnya. Metode tersebut dapat ditentukan dari jenis teknologi yang digunakan dalam metode penambangan atau penggunaan pengetahuan sebelumnya sudut: derajat.

(1) Jenis teknologi yang digunakan
Terutama ada metode statistik, metode berbasis jarak, metode berbasis kepadatan, metode berbasis clustering, metode berbasis deviasi, metode berbasis kedalaman, metode berbasis transformasi wavelet, metode berbasis grafik, metode berbasis pola, dan jaringan saraf metode, dll.

(2) Pemanfaatan pengetahuan sebelumnya
Bergantung pada ketersediaan informasi kelas normal atau outlier, ada tiga pendekatan umum:
① Metode deteksi outlier tanpa pengawasan, yaitu tidak ada pengetahuan sebelumnya seperti label kategori dalam kumpulan data;
② Metode deteksi outlier yang diawasi, yaitu mengekstraksi karakteristik outlier melalui keberadaan set pelatihan yang berisi outlier dan titik normal;
③ Metode deteksi outlier semi-supervised, data pelatihan berisi data berlabel normal, tetapi tidak ada informasi tentang objek data outlier.