CTGAN for Credit Analysis Synthetic Data

Ray Islam, PhD
4 min readApr 10, 2023

--

Image by freepik.com

NOTE: Also published in KDnuggets.

This one is about generating synthetic data with a larger credit analysis data set (appx. 10,000 rows)

CTGAN and other generative AI models can create synthetic tabular data for ML training, data augmentation, testing, privacy-preserving sharing, and more.

Credit analysis data contains client data in continuous and discrete/categorical formats. For demonstration purposes, I have pre-processed the data by removing rows with null values and deleting a few columns that were not needed for this demonstration. Due to limitations in computational resources, running all the data and all columns would require a lot of computation power that I do not have. Here is the list of columns for continuous and categorical variables (discrete values such as Count of Children (CNT_CHINDREN) are treated as categorical variables):

Categorical Variables:

TARGET

NAME_CONTRACT_TYPE

CODE_GENDER

FLAG_OWN_CAR

FLAG_OWN_REALTY

CNT_CHILDREN

Continuous Variables

AMT_INCOME_TOTAL

AMT_CREDIT

AMT_ANNUITY

AMT_GOODS_PRICE

Generative models require a large amount of clean data to be trained on for better results. However, due to limitations in computation power, I have selected only 10,000 rows (precisely 9,993) from the over 300,000 rows of real data for this demonstration. Although this number may be considered relatively small, it should be sufficient for the purpose of this demonstration.

Location of the Real Data:

https://www.kaggle.com/datasets/kapoorshivam/credit-analysis

Location of the generated synthetic Data: https://www.kaggle.com/datasets/drrayislam/synthetic-credit-analysis-data-by-ctgan

https://www.researchgate.net/publication/369826197_Synthetic_Tabular_Data_Set_Generated_by_CTGAN

DOI: 10.13140/RG.2.2.23275.82728

Credit Analysis Data | Image by Author

Results

I have generated 10k (9997 to be exact) synthetic data points and compared them to the real data. The results look good, although there is still potential for improvement. In my analysis, I used the default parameters, with ‘relu’ as the activation function and 3000 epochs. Increasing the number of epochs should result in better generation of real-like synthetic data. The generator and discriminator loss also looks good, with lower losses indicating closer similarity between the synthetic and real data:

Generator and discriminator loss | Image by Author

The dots along the diagonal line in the Absolute Log Mean and Standard Deviation diagram indicate that the quality of the generated data is good.

Absolute Log Mean and Standard Deviations of Numeric Data | Image by Author

The cumulative sums in the following figures for continuous columns are not exactly overlapping, but they are close, which indicates good generation of synthetic data and absence of overfitting. The overlap in categorical/discrete data suggests that the synthetic data generated is near-real. Further statistical analyses are presented in the following figures:

Cumulative Sums per feature | Image by Author
Distribution of Features| Image by Author
Distribution of Features | Image by Author
Principal Component Analysis | Image by Author

The following correlation diagram shows noticeable correlations between the variables. It is important to note that even after thorough fine-tuning, there may be variations in properties between real and synthetic data. These differences can actually be beneficial, as they may reveal hidden properties within the dataset that can be leveraged to create novel solutions. It has been observed that increasing the number of epochs leads to improvements in the quality of synthetic data.

Correlation among variables (Real Data) | Image by Author
Correlation among variables (Synthetic Data) | Image by Author

The summary statistics of both the sample data and real data also appear to be satisfactory.

Summary Statistics of Real Data and Synthetic Data | Image by Author

Conclusion

The training process of CTGAN is expected to converge to a point where the generated synthetic data becomes indistinguishable from the real data. However, in reality, convergence cannot be guaranteed. Several factors can affect the convergence of CTGAN, including the choice of hyperparameters, the complexity of the data, and the architecture of the models. Furthermore, the instability of the training process can lead to mode collapse, where the generator produces only a limited set of similar samples instead of exploring the full diversity of the data distribution.

https://gist.github.com/rayisl78/2b26bb0d3f673fac097d67a06965e67f

About the Author:

Dr. Ray Islam (Mohammad R Islam) is a Data Scientist (AI and ML) and Advisory Specialist Leader at Deloitte, USA. He holds a PhD in Engineering from the University of Maryland, College Park, MD, USA and has worked with major companies like Lockheed Martin and Raytheon, serving clients such as NASA and the US Airforce. Ray also has a MASc in Engineering from Canada, a MSc in International Marketing, and an MBA from, UK. He is also the Editor-in-Chief of the upcoming peer-reviewed International Research Journal of Ethics for AI (INTJEAI), and his research interests include generative AI, augmented reality, XAI, and ethics in AI.

Link: https://blog.umd.edu/rayislam/

--

--

Ray Islam, PhD
Ray Islam, PhD

Written by Ray Islam, PhD

PhD in ML | AI Scientist | Professor | Author | Speaker | Reviewer: ICLR; RESS; JPHM | Member: AAAI | Marquis Who's Who | PhD | MASc | MSc | MBA | BSc. Eng.

No responses yet