There are several reasons why synthetic data can be useful for machine learning:

  1. Synthetic data can be used to train machine learning models when real data is not available or is difficult to obtain.
  2. Synthetic data can be used to augment real data, which can help improve the performance of machine learning models.
  3. Synthetic data can be used to test machine learning models in situations where real data is not available or is not representative of the conditions in which the model will be used.
  4. Synthetic data can be used to test machine learning models in situations where real data is sensitive or confidential, and cannot be shared.

In the Kaggle Tabular Data Playgroud Series synthetic data was created thanks to a deep learning generative network called CTGAN. This is the paper.

Synthetic Data Vault, an MIT initiative, created the technology and a number of tools around it. These tools help to generate synthetic data that mimics real data.