Creating synthetic data from a CSV¶

Dedomena helps to generate synthetic observations from real data in a matter of minutes, hours or days. From data scientists looking to test a machine learning model or business analysts seeking to create reports on fictional data, Dedomena´s software is the perfect solution thanks to its flexibility, ease of use and robustness for tasks where synthetic data is needed to fuel business outcomes, comply with regulations or protect data privacy. Managers, students, engineers, data scientists, and others can easily use NUCLEUS synthetic data generation technology with little training and assessment.

Below, is outlined the key features of Dedomena's software for generating synthetic data and the steps needed to follow to get started:

1. Go to Nucleus tab¶

If a subscription is not active the welcome view will appear, where the free trial or a subscription to one of Nucleus plans, can be activated.

Once the free trial or subscription is active, the view for managing synthesizers or creating or uploading a synthesizer will replace the welcome view.

The next step is to upload a CSV to the platform through the ADD DATASET button in the MY DATA space in the AXON tab.

2. Upload a CSV through AXON¶

In AXON, in the MY DATA space, clicking on the ADD DATASET button will show the options to start the CSV upload process. MY DATA space is where all the user datasets will be privately available regardless of the source, whether the user have uploaded or connected or acquired.

For this tutorial select the first option: Upload a CSV file to synthesize.

Only 4 steps are needed to have a dataset uploaded and available on the platform. For this example, this publicly available dataset was used: Kaggle - Stroke Prediction Dataset.

Step 1: The CSV is browsed and selected¶

Next the name and description of the dataset that will be uploaded are defined.

After that, a preview is available and it's posible to change the name of each variable checking the box This table does not contain a header.

Step 2: Define columns properties.¶

The software detects automatically the data type of each variable but it is recommended to validate and change the type according to the data characteristics and feature space knowledge that's only controlled by the user.

At this point the user can select which variables are the target, primary key or sensitive with their type, if applicable.

More than one dataset can be uploaded, making possible in this step to define the relationships between the tables/datasets through the foreign key.

Step 3: Table relationships.¶

If more than one dataset was uploaded, in this step the user can verify and validate the relationships between the tables/datasets.

Step 4: Dataset metadata.¶

Finally, some metadata about the dataset is requested to make possible further filtering and advanced searching.

3. Create the synthesizer to generate Synthetic Data¶

If the data was uploaded correctly, once you click on CREATE SYNTHESIZER:

There will appear the dataset to be selected. Some general information like the number of rows and columns or the metadata dimensions are shown.

From this view, it is possible to execute a synthesization Run using the computing power of the Dedomena cloud which runs on top tier GPUs and has some time constraints (24 hours of maximum execution). To execute more computational expensive or time consuming Runs, an on-premise component is provided: NUCLEUS EDGE.

Step 1: Select dataset¶

To continue just check the corresponding dataset in the include column and click Next.

Note: The dataset needs to have at least 1000 rows to be valid to be synthesized. Depending on the plan there is a limit regarding the maximum number of rows too.

Step 2: Select columns¶

Depending on the plan a maximum number of columns are allowed for synthesization. In this example the number was 10. Using the arrows the columns could be easily included/excluded from the specific Run configuration.

Step 3: Data Profiling¶

Before training the synthesization algorithm, Dedomena provides a data profiling section where users can take a look at the dataset to be synthesized and get information such as the number of missing or unique values per column, distribution plots, and statistics as minimum, average or maximum.

Step 4: Run parameters¶

The last step is to define the Run parameters to create the synthesizer including the following information:

Synthesizer name.
Synthesizer description.
Algorithm to be trained on the selected data.
Use case that matches the purpose or application of the synthesizer to be created.
Number of epochs. Given the data complexity some datasets would require more epochs to properly learn the information contained in the variables.
Batch size.

4. NUCLEUS Synthesizers¶

If the Run was properly launched the following view will appear, showing a button to go to Synthesizers tab in NUCLEUS.

Once in Synthesizers, the status, logs and progress of the Run while creating the synthesizer could be checked.

Through the Refresh button the information can be updated. The Run can be canceled clicking the cross icon.

The logs allows to track the synthesization process step by step.

From this view it's possible to access each synthesizer that has been:

Uploaded (trained locally on-premises) or
Created directly in the platform (cloud)

To gather the completed (or failed) synthesizer information: logs, synthetic data evaluation metrics, synthetic data generation tab, etc.

Step 1: Generating Synthetic Data¶

To access the synthesizer information click the Access Synthesizer icon (the one with doc and the arrow) once the Run is completed. There are 3 tabs:

Run: general information like the training time, synthesization algorithm trained, number of epochs, etc.
Evaluations: Dedomena scores at a glimpse (Quality, Utility, Privacy) and a button to download the report that extensively analyzes and compares the real vs synthetic data to determine how good and secure the synthetic data is.
Generate: generate and download the synthetic data

Next move to the Generate tab and select the number of synthetic rows to be generated. Once the process finish select the format of the synthetic data file and download the synthetic dataset using the Download button to the right.

Congratulations!! An end-to-end data synthesization and generation process have been completed in few steps obtaining best in class synthetic data.

Follow Dedomena AI on LinkedIn