Unveiling the Potential of CTGAN: Harnessing Generative AI & 50K Synthetic IRIS Dataset

Ray Islam, PhD
7 min readApr 5, 2023

--

CTGAN (Conditional Tabular Generative Adversarial Network): is a generative AI algorithm that can create synthetic tabular data similar to real-world data. It uses a generator and discriminator neural network to create synthetic data with similar statistical properties to real data, and it can preserve the underlying structure of the real data, including correlations between columns. CTGAN is useful for generating synthetic data for machine learning applications, data privacy, data analysis, and data augmentation. In this pre-print I have generated a large volume of synthetic IRIS data from CTGAN. Feel free to use this data in your projects and report back any areas within the data set that need improvement.

Parameters:

Parameters of CTGAN depend on the specific implementation and settings chosen by the user. Some of the common parameters include:

· Epochs: Number of times generator and discriminator networks are trained on dataset.

· Learning rate: The rate at which the model adjusts the weights during training.

· Batch size: The number of samples used in each training iteration.

· Generator and discriminator networks size.

· Choice of optimization algorithm.

Additionally, CTGAN can also take in various hyperparameters, such as the dimensionality of the latent space, the number of layers in the generator and discriminator networks, and the activation functions used in each layer. The choice of parameters and hyperparameters can affect the performance and quality of the generated synthetic data.

Pros:

· Generates synthetic tabular data that that has similar statistical properties as the real data, including correlations between different columns.

· Preserves underlying structure of real data.

· The synthetic data generated by CTGAN can be used for a variety of applications, such as data augmentation, data privacy, and data analysis.

· Can handle continuous, discrete, and categorical data.

Cons

· CTGAN requires a large amount of real tabular data to train the model and generate synthetic data that has similar statistical properties to the real data.

· CTGAN is computationally intensive and may require a significant amount of computing resources.

· The quality of the synthetic data generated by CTGAN may vary depending on the quality of the real data used to train the model.

Validation of CTGANs:

Like other GANs, CTGANs also have limitations such as difficulties in evaluation of the quality of the generated synthetic data, particularly when it comes to tabular data. While there are some metrics that can be used to evaluate the similarity between the real and synthetic data, it can still be challenging to determine if the synthetic data accurately represents the underlying patterns and relationships in the real data. Additionally, CTGANs are vulnerable to overfitting and can produce synthetic data that is too similar to the training data, which may limit their ability to generalize to new data. Few validations technique include:

· Statistical Tests: These tests check the statistical properties of the generated data and compare them to the real data. For example, tests such as correlation analysis, Kolmogorov-Smirnov test, Anderson-Darling test, and chi-squared test to compare the distributions of the generated and real data.

· Visualization: By plotting histograms, scatterplots, or heatmaps to visualize the similarities and differences between the two datasets.

· Application Testing: Testing the generated synthetic data by using it in real-world applications, such as machine learning models or data analysis. If the synthetic data performs similarly to the real data, then it can be considered valid.

Case Study

For demonstration purposes, I used the IRIS dataset to generate synthetic data using CTGAN.

About IRIS Data Set (UCI Machine Learning Repository: Iris Data Set)

The IRIS dataset is one of the most widely used and well-known datasets in machine learning and data science. It was first introduced in 1936 by Ronald Fisher as a way to demonstrate the use of linear discriminant analysis for classification tasks. Since then, the IRIS dataset has been used in a wide range of applications, including machine learning algorithms, data visualization, and statistical analysis. It is often used as a benchmark dataset for evaluating the performance of new algorithms and techniques. Many papers were written on analysis carried out based on this data set. The IRIS dataset is a commonly used dataset in machine learning and statistics. It consists of 150 samples of iris flowers, with each sample containing four features (sepal length, sepal width, petal length, and petal width) and a target variable indicating the species of the flower (setosa, versicolor, or virginica). The dataset (Image1)is often used as a benchmark for classification tasks and is widely used in teaching and research.

Image1: IRIS data set header

Python Code

Code location: https://github.com/rayisl78/generativeAI/blob/main/ctgans_iris_tabulardata_synthetic.ipynb

Generated synthetic Data Location:

Tools Used

Google Colab

Results

We used 150 rows of real data and generated 50,000 rows of synthetic data. This is a bold move given that generative algorithms require a large amount of data. An observation of 150 is not considered to be large, yet it still produced satisfactory results. However, there is potential for improvement through fine-tuning the model and taking other appropriate actions. One of the challenges we faced was the small volume of training data, which can limit performance. We used default parameters and the ‘relu’ activation function for CTGAN. As shown in the following analysis, increasing the number of epochs resulted in improved efficiency. I included model analysis results from both 10,000 and 100,000 epochs in my analysis. It was observed that synthetic data became closer to real data as the number of epochs increased. Generator loss and discriminator loss also looks good (image2):

Image 2: Generator and discriminator loss

Figure1. and Figure2. shows that the Absolute Log Mean and Standard Deviation are along the diagonal line, indicating that the quality of the generated data is good.

Figure1: Absolute Log Mean and Standard Deviations of Numeric Data (10,000 Epoch)
Figure2: Absolute Log Mean and Standard Deviations of Numeric Data (100,000 Epoch)

The cumulative sums of sepal length, sepal width, petal length, and petal width in Figure 3 and Figure 4 are not exactly overlapping, but they are close. By fine-tuning the model and increasing the volume of training data, we might be able to achieve better results. The overlap in species refers to the generation of near-real synthetic data. Some of the statistical analyses are presented below in Figure 1 through Figure 8:

Figure3: Cumulative Sums per feature (10000) Epoch
Figure4: Cumulative Sums per feature (100000) Epoch
Figure 5: Distribution of Features (10000 epoch)
Figure 6: Distribution of Features (100000 epoch)
Figure 7: Principal Component Analysis (10000 epoch)
Figure 8: Principal Component Analysis (100000 epoch)

Based on the correlation diagram (Figure9, 10, 11), there is a noticeable correlation between the variables. Though, there is still room for improvement by following the steps outlined above. It is important to note that even after thorough fine-tuning, there may be variations in properties between real and synthetic data. These differences can actually be beneficial, as they may reveal hidden properties within the dataset that can be leveraged to create novel solutions. It has been observed that increasing the number of epochs leads to improvements.

Figure 9: Correlation among variables
Figure 10: Correlation among variables (Synthetic Data, 10000 Epoch)
Figure 11: Correlation among variables (Synthetic Data, 100000 Epoch)

The summary statistics of both the sample data and real data (as shown in Image 3) also appear satisfactory and acceptable.

Image 3: Summary Statistics of Real Data and Synthetic Data (100000 Epoch)

Note:

1. The generated synthetic dataset included a few negative values in Petal Width, which is not possible as width cannot be negative. I removed these values, and this was the only modification made to the synthetic data.

2. In theory, the training process of GANs should converge to a point where the generated synthetic data is indistinguishable from the real data. However, in practice, convergence is not guaranteed, and the training process can be very challenging. Several factors can impact the convergence of GANs, including the choice of hyperparameters, the complexity of the data, and the architecture of the models. Additionally, the instability of the training process can result in mode collapse, where the generator produces only a limited set of similar samples, instead of exploring the full diversity of the data distribution.

3. The author performed all the analyses and generated the tables, as well as capturing all of the images.

About the Author:

Dr. Ray Islam (Mohammad R Islam) is a Data Scientist (AI and ML) and Advisory Specialist Leader at Deloitte, USA. He holds a PhD in Engineering from the University of Maryland, College Park, MD, USA and has worked with major companies like Lockheed Martin and Raytheon, serving clients such as NASA and the US Airforce. Ray also has a MASc in Engineering from Canada, a MSc in International Marketing, and an MBA from, UK. He is also the Editor-in-Chief of the upcoming peer-reviewed International Research Journal of Ethics for AI (INTJEAI), and his research interests include generative AI, augmented reality, XAI, and ethics in AI.

Link: https://blog.umd.edu/rayislam/

--

--

Ray Islam, PhD
Ray Islam, PhD

Written by Ray Islam, PhD

PhD in ML | AI Scientist | Professor | Author | Speaker | Reviewer: ICLR; RESS; JPHM | Member: AAAI | Marquis Who's Who | PhD | MASc | MSc | MBA | BSc. Eng.

No responses yet