CTGAN for Credit Analysis Synthetic Data
NOTE: Also published in KDnuggets.
This one is about generating synthetic data with a larger credit analysis data set (appx. 10,000 rows)
CTGAN and other generative AI models can create synthetic tabular data for ML training, data augmentation, testing, privacy-preserving sharing, and more.
Credit analysis data contains client data in continuous and discrete/categorical formats. For demonstration purposes, I have pre-processed the data by removing rows with null values and deleting a few columns that were not needed for this demonstration. Due to limitations in computational resources, running all the data and all columns would require a lot of computation power that I do not have. Here is the list of columns for continuous and categorical variables (discrete values such as Count of Children (CNT_CHINDREN) are treated as categorical variables):
Categorical Variables:
TARGET
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
CNT_CHILDREN
Continuous Variables
AMT_INCOME_TOTAL
AMT_CREDIT
AMT_ANNUITY
AMT_GOODS_PRICE
Generative models require a large amount of clean data to be trained on for better results. However, due to limitations in computation power, I have selected only 10,000 rows (precisely 9,993) from the over 300,000 rows of real data for this demonstration. Although this number may be considered relatively small, it should be sufficient for the purpose of this demonstration.
Location of the Real Data:
https://www.kaggle.com/datasets/kapoorshivam/credit-analysis
Location of the generated synthetic Data: https://www.kaggle.com/datasets/drrayislam/synthetic-credit-analysis-data-by-ctgan
https://www.researchgate.net/publication/369826197_Synthetic_Tabular_Data_Set_Generated_by_CTGAN
DOI: 10.13140/RG.2.2.23275.82728
Results
I have generated 10k (9997 to be exact) synthetic data points and compared them to the real data. The results look good, although there is still potential for improvement. In my analysis, I used the default parameters, with ‘relu’ as the activation function and 3000 epochs. Increasing the number of epochs should result in better generation of real-like synthetic data. The generator and discriminator loss also looks good, with lower losses indicating closer similarity between the synthetic and real data:
The dots along the diagonal line in the Absolute Log Mean and Standard Deviation diagram indicate that the quality of the generated data is good.
The cumulative sums in the following figures for continuous columns are not exactly overlapping, but they are close, which indicates good generation of synthetic data and absence of overfitting. The overlap in categorical/discrete data suggests that the synthetic data generated is near-real. Further statistical analyses are presented in the following figures:
The following correlation diagram shows noticeable correlations between the variables. It is important to note that even after thorough fine-tuning, there may be variations in properties between real and synthetic data. These differences can actually be beneficial, as they may reveal hidden properties within the dataset that can be leveraged to create novel solutions. It has been observed that increasing the number of epochs leads to improvements in the quality of synthetic data.
The summary statistics of both the sample data and real data also appear to be satisfactory.
Conclusion
The training process of CTGAN is expected to converge to a point where the generated synthetic data becomes indistinguishable from the real data. However, in reality, convergence cannot be guaranteed. Several factors can affect the convergence of CTGAN, including the choice of hyperparameters, the complexity of the data, and the architecture of the models. Furthermore, the instability of the training process can lead to mode collapse, where the generator produces only a limited set of similar samples instead of exploring the full diversity of the data distribution.
https://gist.github.com/rayisl78/2b26bb0d3f673fac097d67a06965e67f
About the Author:
Dr. Ray Islam (Mohammad R Islam) is a Data Scientist (AI and ML) and Advisory Specialist Leader at Deloitte, USA. He holds a PhD in Engineering from the University of Maryland, College Park, MD, USA and has worked with major companies like Lockheed Martin and Raytheon, serving clients such as NASA and the US Airforce. Ray also has a MASc in Engineering from Canada, a MSc in International Marketing, and an MBA from, UK. He is also the Editor-in-Chief of the upcoming peer-reviewed International Research Journal of Ethics for AI (INTJEAI), and his research interests include generative AI, augmented reality, XAI, and ethics in AI.