Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets

Mohd Suhairi, Md Suhaimin and Mohd Hanafi, Ahmad Hijazi and Moung, Ervin Gubin and Mohd Azwan, Mohamad Hamza (2023) Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets. In: 5th IEEE International Conference on Artificial Intelligence in Engineering and Technology, IICAIET 2023 , 12-14 September 2023 , Kota Kinabalu. pp. 257-261. (193996). ISBN 979-835030415-2

[img] Pdf
Data augmentation approach for language identification.pdf
Restricted to Repository staff only

Download (1MB) | Request a copy
[img]
Preview
Pdf
Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets_ABS.pdf

Download (885kB) | Preview

Abstract

Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data.

Item Type: Conference or Workshop Item (Keynote)
Additional Information: Indexed by Scopus
Uncontrolled Keywords: Code-mixing; Data augmentation; Language identification; Social media
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
T Technology > T Technology (General)
T Technology > TA Engineering (General). Civil engineering (General)
Faculty/Division: Faculty of Computing
Depositing User: Mr Muhamad Firdaus Janih@Jaini
Date Deposited: 16 Apr 2024 04:18
Last Modified: 16 Apr 2024 04:18
URI: http://umpir.ump.edu.my/id/eprint/40378
Download Statistic: View Download Statistics

Actions (login required)

View Item View Item