The new efficient and accurate attribute-oriented clustering algorithms for categorical data

Qin, Hongwu (2012) The new efficient and accurate attribute-oriented clustering algorithms for categorical data. PhD thesis, Universiti Malaysia Pahang (Contributors, Thesis advisor: Jazni, Mohamad Zain).

Preview

Pdf
The new efficient and accurate attribute-oriented clustering algorithms for categorical data.pdf
Download (1MB) | Preview

Abstract

Categorical data clustering has attracted much attention recently due to the fact that much of the data contained in today’s databases is categorical in nature. Many algorithms for clustering categorical data have been proposed, in which attribute-oriented hierarchical divisive clustering algorithm Min-Min Roughness (MMR) has the highest efficiency among these algorithms with low clustering accuracy, conversely, genetic clustering algorithm Genetic-Average Normalized Mutual Information (G-ANMI) has the highest clustering accuracy among these algorithms with low clustering efficiency. This work firstly reveals the significance of attributes in categorical data clustering, and then investigates the limitations of algorithms MMR and G-ANMI respectively, and correspondingly proposes a new attribute-oriented hierarchical divisive clustering algorithm termed Mean Gain Ratio (MGR) and an improved genetic clustering algorithm termed Improved G-ANMI (IG-ANMI) for categorical data. MGR includes two steps: selecting clustering attribute and selecting equivalence class on the clustering attribute. Information theory based concepts of mean gain ratio and entropy of clusters are used to implement these two steps, respectively. MGR can be run with or without specifying the number of clusters while few existing clustering algorithms for categorical data can be run without specifying the number of clusters. IG-ANMI algorithm improves G-ANMI by developing a new attribute-oriented initialization method in which part of initial chromosomes is generated by using the attributes partitions. Four real-life data sets obtained from University of California Irvine (UCI) machine learning repository and ten synthetically generated data sets are used to evaluate MGR and IG-ANMI algorithms, and other four algorithms are used to compare with these two algorithms. The experimental results show that MGR overcomes the limitations of MMR and the average clustering accuracy is improved by 19% (from 0.696 to 0.83), at the same time maintains the highest efficiency. IG-ANMI greatly improves the efficiency of G-ANMI (improved by 31% on the Zoo data set, 74% on the Votes data set, 59% on the Breast Cancer data set, and 3428% on the Mushroom data set) as well as the clustering accuracy of G-ANMI (the average clustering accuracy on four UCI data sets is improved by 10.6%, from 0.815 to 0.901), at the same time maintains the highest clustering accuracy. IG-ANMI has obvious advantage against G-ANMI on large data sets in terms of clustering efficiency as well as clustering accuracy. In addition, both of MGR and IG-ANMI have good scalability. The running time of MGR and IG-ANMI algorithms tend to vary linearly with the increase of the number of objects as well as the number of clusters.

Item Type:	Thesis (PhD)
Additional Information:	Thesis (Doctor of Philosophy in Computer Science) -- Universiti Malaysia Pahang - 2012, SV: PROFESSOR DR. JASNI MOHAMAD ZAIN, NO. CD: 6310
Uncontrolled Keywords:	Cluster analysis; Cluster analysis Data processing
Faculty/Division:	Faculty of Computer System And Software Engineering
Depositing User:	En. Mohd Ariffin Abdul Aziz
Date Deposited:	20 Feb 2023 07:41
Last Modified:	20 Feb 2023 07:41
URI:	http://umpir.ump.edu.my/id/eprint/37063
Download Statistic:	View Download Statistics

Actions (login required)

View Item