Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders

Ömer Faruk Cebeci; Mehmet Fatih Amasyali

doi:10.56038/oprd.v5i1.581

Back to Journal

Research Article Open AccessOrclever Native

Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders

Ömer Faruk Cebeci¹,

Mehmet Fatih Amasyali²

¹Yıldız Technical University

²Yıldız Technical University

Received:Nov 16, 2024→Accepted:Dec 24, 2024→Published:December 31, 2024

DOI: 10.56038/oprd.v5i1.581

Vol. 5, No. 1

1 Views0 Downloads2 Cited by

AI SummaryAI-generated · verify against the full text

Generating synthetic data using a CVAE model can improve text classification performance, especially in data-scarce scenarios.

Researchers used a Conditional Variational Autoencoder (CVAE) model to generate synthetic data for text classification tasks. They tested two methods of generating synthetic data: one that started with random noise and another that completed partially given sentences. The results showed that incorporating synthetic data significantly improved classification performance, but the benefit decreased as the amount of training data increased. This suggests that generating synthetic data can be an effective strategy for improving text classification performance in data-scarce scenarios.

Abstract

This study investigates the effect of generating synthetic data using a Conditional Variational Autoencoder (CVAE) model on classification performance in scenarios where the amount of available data is limited or the data sources are constrained. Experiments were conducted on datasets with varying numbers of classes, where synthetic data were produced through two different methods using CVAE models. The first method aimed to generate sentences from noise, initiated by sampling from a Gaussian distribution. The second method involved providing the first half of a real sentence to the model, which then completed the remaining half to produce synthetic data. The synthetic datasets generated by both methods were integrated into the original training sets at various ratios, and the resulting changes in classification performance were observed. Both synthetic data generation approaches significantly improved the classification performance. However, as the amount of data used to train the classifiers increased, the marginal benefit of incorporating synthetic data decreased. These findings suggest that producing and utilizing synthetic data can be an effective strategy in text classification tasks that suffer from data scarcity.

Keywords

Metin sınıflandırmaVaryasyonel Otokodlayıcıyapay metin üretimi

References

1.X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” 2016.
2.Z. Xie, S. I. Wang, J. Li, D. Lvy, A. Nie, D. Jurafsky, and A. Y. Ng, “Data noising as smoothing in neural network language models,” 2017.
3.J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6382–6388.
4.R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” 2016.
5.HU, Zhiting, et al. Toward controlled generation of text. In: International conference on machine learning. PMLR, 2017. p. 1587-1596.
6.T. Zhao, R. Zhao, and M. Eskenazi, “Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders,” arXiv (Cornell University), Jan. 2017, doi: 10.48550/arxiv.1703.10960.DOI
7.T. Wang and X. Wan, “T-cvae: Transformer-based conditioned variational autoencoder for story completion,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 7 2019, pp. 5233–5239.
8.D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2014.
9.S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio, “Generating sentences from a continuous space,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 10–21.
10.S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
11.Bilici, M. Şafak; Amasyali, Mehmet Fatih. Transformers as neural augmentors: Class conditional sentence generation via variational Bayes. arXiv preprint arXiv:2205.09391, 2022.
12.Sezer, T. (2021). TS TimeLine News Category Dataset (Version 001) [Data set]. TS Corpus. https://doi.org/10.57672/P23D-B492DOI
13.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186

Download PDF

Cite This Article

Cebeci, Ö. F., Amasyali, M. F. (2024). Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders. *Orclever Proceedings of Research and Development*, 5(1), 498-514. https://doi.org/10.56038/oprd.v5i1.581

Bibliographic Info

JournalOrclever Proceedings of Research and Development

Volume5

Issue1

Pages498–514

PublishedDecember 31, 2024

eISSN2980-020X

Indexing & License

Open AccessCC BY 4.0CrossRef DOIORCIDOAI-PMH

More from Orclever Proceedings of Research and Development

Single-Bath Dyeing of Blends of Cotton Fibers with New Generation Polyacrylonitrile Fibers with Reactive Dye in Line with the Target of Sustainable Production

Yıldıray Fatih Dilsiz, Seda Keskin, Rıza Atav

2025 · Vol 7 · Issue 1

Development of a Secure Structural Component to Mitigate Environmental Contamination at Ports During the Transfer of Granular Materials in Global Maritime Logistics: Ecological Port Loading Bunker

Özge Güler, M.Cemal Çakır

2025 · Vol 7 · Issue 1

Browse all articles in Orclever Proceedings of Research and Development →

Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders

Abstract

References

More from Orclever Proceedings of Research and Development

Single-Bath Dyeing of Blends of Cotton Fibers with New Generation Polyacrylonitrile Fibers with Reactive Dye in Line with the Target of Sustainable Production

The Green Step Upper: A Novel Sustainable Bonding Method Replacing Solvent-Based Adhesives in Footwear Upper Assembly

Innovative Technological Strategies to Enhance Bioavailability in Germinated Grains

Graph-Based Customer Segmentation with GraphSAGE on a Customer–Vehicle Bipartite Network

Natural Language Processing-Based Layered Reconciliation System for Financial Transaction Analysis

Development of a Secure Structural Component to Mitigate Environmental Contamination at Ports During the Transfer of Granular Materials in Global Maritime Logistics: Ecological Port Loading Bunker