Sharifah Sakinah, Syed Abd Mutalib (2023) Improved robust estimator and clustering procedures for multivariate outliers detection. PhD thesis, Universiti Malaysia Pahang Al-Sultan Abdullah (Contributors, Thesis advisor: Siti Zanariah, Satari).
|
Pdf
ir.SHARIFAH SAKINAH BT SYED ABD MUTALIB_PSS18002.pdf - Accepted Version Download (672kB) | Preview |
Abstract
Outlier detection for multivariate data has been one of the areas that garnered attention to study due to the difficulty that arises as the number of variables, p increases. Visual inspection is insufficient to detect outliers in multivariate data, unlike univariate data. One of the methods to detect outliers in multivariate data is by using distance-based methods, which is Mahalanobis distance (MD). However, the sample mean and covariance matrix in MD is bound to masking and swamping problems. Therefore, many studies use robust estimators to replace the sample mean and covariance matrix. The development of robust estimators still continues until now. Although the robust estimator can overcome the problem of MD, it is still limited to detecting single point outliers only. Therefore, cluster-based methods have been proposed and developed in previous studies to overcome this problem. Hence, the main objective of this study is to propose a robust estimator in order to develop an improved procedure for detecting outliers in multivariate data using robust clustering-based methods. Firstly, an improved robust estimator based on the equality of covariance matrices that is less sensitive to the presence of outliers is proposed and named as Test on Covariance (TOC). TOC is developed by modified Concentration-Step (C-Step) in the Fast Minimum Covariance Determinant (FMCD) algorithm. In this step, the equality of covariance matrices test is done, and TOC is obtained. Secondly, an improved single linkage robust clustering procedure is developed. The similarity measure used in this procedure is the robust distance using TOC, named RDT. The improved single linkage robust clustering is robustified using RDT. Then, the performance of the proposed robust estimator and clustering procedure in detecting outliers for multivariate data are investigated using simulation studies and historical datasets. A data generation procedure is formulated in the simulation study to create synthetic data with three Outlier Scenarios using the R language. Three Outlier Scenarios used in this study are the Mean-shift (Outlier Scenario 1), Variance-inflation (Outlier Scenario 2), and Mean-shift and variance-inflation (Outlier Scenario 3). Three measurements are used to assess the effectiveness of the proposed robust estimator and clustering procedure, which are the probability that all the outliers are successfully detected (pout), the probability that the outliers are falsely detected as inliers (pmask), and the probability of inliers detected as outliers (pswamp). In particular, five historical datasets are used, which are Stackloss, Brain and Weight, Bushfire, Hawkins-Bradu Kass, and Milk. In this study, the performance of TOC in detecting outliers is compared with other existing robust estimators, which are Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME) and Index Set Equality (ISE). Based on the simulation study, TOC shows good results in pswamp for all Outlier Scenarios, which indicates TOC has the lowest probability of misclassifying inliers as outliers compared to other robust estimators. TOC also shows similar performance as other robust estimators in most conditions. If the three measurements are considered simultaneously, TOC is the better estimator for the sample size, n = 30,50,100,200, number of variables, p = 3,5,10 and all percentages of outliers, 1% ≤ ε ≤ 25%. TOC also has proven able to detect outliers, does not have a masking effect, and performs similarly to other robust estimators in the historical datasets. Meanwhile, the performance of the improved single linkage robust clustering procedure is compared with single linkage by using Euclidean (ED), Mahalanobis distance (MD),and TOC. Based on the simulation study, RDT only becomes the better similarity measure in a few conditions for pout, pmask and pswamp and performs similarly to other similarity measures in most conditions for all Outlier Scenarios. If the performance measurement of pout, pmask as well as pswamp are considered simultaneously for all Outlier Scenarios, RDT is the better similarity measure when n =50,100, p =3,5 and ε = 5%,10%,15%. Moreover, RDT is the better similarity measure when the historical dataset contains 19% outliers, p =3 and n ˂ 100. From the findings of the simulation study and historical datasets, both TOC and RDT did not perform well for large sample size. It is also found that TOC outperforms RDT’s ability to detect outliers in multivariate data. Therefore, this study concluded that TOC is a promising robust estimator and can be an alternative to other robust estimators for detecting outliers in multivariate data. RDT can also be used as an alternative similarity measure in clustering procedures and can also be used in other clustering methods. TOC can be further applied in other multivariate methods such as Principal Component Analysis, Factor Analysis and Discriminant Analysis. Furthermore, the improved single linkage robust clustering procedure in this study can be incorporated with Minimum Spanning Tree (MST).
Item Type: | Thesis (PhD) |
---|---|
Additional Information: | Thesis (Doctor of Philosophy) -- Universiti Malaysia Pahang – 2023, NO. CD: 13420, SV: Dr. Siti Zanariah Satari |
Uncontrolled Keywords: | Mahalanobis distance (MD) |
Subjects: | Q Science > Q Science (General) Q Science > QA Mathematics |
Faculty/Division: | Institute of Postgraduate Studies Center for Mathematical Science |
Depositing User: | Mr. Nik Ahmad Nasyrun Nik Abd Malik |
Date Deposited: | 06 Jun 2024 02:20 |
Last Modified: | 06 Jun 2024 08:59 |
URI: | http://umpir.ump.edu.my/id/eprint/41478 |
Download Statistic: | View Download Statistics |
Actions (login required)
View Item |