Skip to content

Nucleus - Synthetic Data Generation

Introduction

Based on Deep Learning state-of-the-art research and results, Dedomena have developed proprietary generative algorithms based on custom deep neural networks architectures complemented with ML and probabilistic models to address issues like outliers privacy, rare data types variables like bank transactions descriptions, among others, that generates synthetic data with the highest quality from real structured data.

The artificial data generated with Dedomena keeps the statistics, distributions, underlying patterns, multivariate relations and relevant value, making it possible to replace the real data and conduct analysis, test applications or train ML models on it, to name a few use cases. As a beneficial effect of being synthetic entities and users, plus our extra layer of privacy constraints and validations, the resulting synthetic data is totally anonymized making it impossible to re-identify any information, behavior, entity or user from the original data.

Dedomena provides two approaches to generate synthetic data based on where the training computation is done: fully cloud or hybrid. In both cases the synthetic data generation is performed in the cloud, standing out that our software is deployed and runs only on European Google Cloud servers in regions with a low carbon footprint.

Nucleus is the core component of Dedomena´s platform. It allows to train synthesizers that are later on used to generate unlimited copies of synthetic data. These synthesizers can be created in Dedomena´s platform (uploading data through Axon section) or in the data source environment, through the Nucleus Edge component.

Through the Nucleus, companies at every stage are able to train synthesizers from various algorithms with just a few parameters or lines of code, allowing the generation of realistic synthetic data for several use cases.

Algorithms

Dedomena offers 4 synthesization algorithms:

Generic: Learn from structured tabular data from any industry where recurrent patterns or other complex date based patterns are not the main characteristic that must keep the synthetic data. This algorithm provides the ability to fine-tune pre-trained synthesizers.

Transactional: Learn from event/transaction based data (irregular time series or not a constant timestamp interval between events or transactions). This algorithm was intended to be used for transactional datasets, therefore there are certain essential columns for it to work correctly. It brings extra parameters only for bank transactional data, as a special type of transactional data, getting superior performance generating bank transactions with text descriptions, amog other capabilities.

Timeseries: As the name suggests, it is used to learn from data that follows a time series structure with equally distanced timestamps. Variables like user/product id and date of the series are mandatory to guarantee excellent global and entity-type results. It’s flexible enough to learn monthly, weekly, daily or hourly time series. Other variables related with the event/transaction can be learned too.

Relational: The relational algorithm is designed to synthesize multi-table datasets. They must adhere to the following conditions:

  1. All tables should be connected in some way. Disconnected tables can be synthesized separately.
  2. There should be no missing references (also known as orphan rows). If table A references table B, then every reference must be found. Otherwise, the row will be removed and not generated. It's okay if a parent row doesn't have children.
  3. There cannot be cyclic references. A table cannot reference itself. Or if table A references table B, then table B cannot reference table A again.
  4. Every foreign key must be a primary key of the table it references.

Nucleus (Cloud)

To generate synthetic data using the cloud version of Nucleus refer to: Quickstart.

Nucleus Edge

Nucleus Edge is a Python library that can be downloaded from the Edge section in the NUCLEUS tab, allowing for flexible deployment of multiple Nucleus nodes across multiple and different environments. The user only needs to specify the OS and the Python version (3.9 or 3.10). From this view the token needed to make the library usable can be copied.

Installation

Nucleus Edge allows users to easily install Dedomena's software and execute synthesizing runs needing only a Python environment, working even on a laptop. Nucleus Edge is a Python library like Pandas or Numpy.

Steps:

  • Install pip if not already installed
    # Debian/Ubuntu
    
    sudo apt update
    sudo apt -y install python3-pip
    
  • Create a Python environment: venv — Creation of virtual environments.
  • Use pip to install the Nucleus Edge Python library:
    pip install nucleus-edge-2.0.1-python-39-x86_64-linux-gnu.tar.gz
    
  • Create a folder where the synthesizers will be persisted (the name could be results for example) granting the adequate access to it to users that will use Nucleus Edge.
    mkdir results
    chmod 755 results
    

Configuration

Once Nucleus Edge is installed into the Python environment, the user is able to create synthetisation runs with just a few lines of code. Several configuration parameters and algorithms are provided to tune the synthetisation process to the data and use case.

These parameters include algorithm, batch size, number of epochs, impute missing values, amongst others. Optionally, you can also replace column names, decide to do the computation on CPU or GPU, change data types as well as other dataset and run configurations, allowing you to generate clean and useful data to cover specific data needs.

Input: Nucleus Edge will accept Parquet and CSV format files as input data.

Output: An encrypted file that needs to be uploaded to Dedomena platform to generate synthetic data: it’s persisted in the defined results_dir. This file does NOT include any original information or sensitive data, it only contains the synthesizer to generate artificial data.

Parameters

The following parameters are common to all the algorithms.

  • data_dir: string. Path to the dataset file or database. Examples: "/myfolder/data_dir/file.csv", "postgresql://username:password@server:port/database". Available data sources:
    • PostgreSQL
    • Oracle
    • SQLite
    • MySQL
    • MariaDB
    • Amazon Redshift
    • Microsoft SQL Server
    • Azure SQL Database
    • Google Cloud Big Query
    • IBM DB2
  • data_format: string. Input data format. Four options: CSV, PARQUET, MTX or DATABASE.
  • token: string. The token provided by Dedomena to make operative Nucleus Edge.
  • algorithm: string. 'generic', 'transactional', 'timeseries'.
  • query: string. Only available when data_format is DATABASE. Name of the table you want to retrieve or SQL query to obtain the data, for example, "SELECT * FROM table".
  • batch_size: int (power of 2: 128, 256, 512, etc.). Size of the batches.
  • epochs: int (values between 100-300 are recommended for generic and transactional). Number of epochs.
  • synthesizer_name: string. Synthesizer's name defined by the user.
  • synthesizer_description: string. Synthesizer's description defined by the user.
  • amplify: string. 'default', 'quality'. Only for generic and transactional algorithms. When amplify='quality' it will boost synthetic data quality slightly decreasing the privacy of the resulting data with respect to the data used to create the synthesizer, obtaining a synthesizer that will generate better synthetic data with respect to data quality. When specified, the number of epochs has to be greater or equal to 150. The default configuration always seek to maximize privacy.
  • impute: bool. Impute missing values ​​when True. Otherwise, it will learn and maintain the distributions of the missing values ​​according to the variable, generating synthetic data with missing values ​​according to the original data.
  • categorical_columns: array. Discrete columns' names in the dataset.
  • date_columns: array or dict. Array with the names of the date columns in the data set. Example: ['date_column_1', 'date_column2', ...]. It is possible to specify the format string for the date column by passing a dictionary with columns names as keys and format string as values, instead of an array with just the column names. Example: {'date_column1': 'str_format1', 'date_column2': 'str_format2', ...}. The following notation should be used.
  • integer_columns: array. Integer columns' names in the dataset.
  • boolean_columns: array. Boolean columns' names in the dataset.
  • float_columns: array. Float columns' names in the dataset.
  • text_columns: dictionary. A dictionary including the name of the text columns in the dataset as keys and the instruction to generate the text for each one as values, for example {'column_name': 'Replace hospital names in the text with <HOSPITAL>'}. If no instructions are given, by default, personally identifiable information (PII) is replaced with synthetic data. In this case, {'column_name': ' '} should be used.
  • output_dir: string. Path where the encrypted file will be persisted.
  • max_categories: int. Maximum number of values that a categorical variable must have. If the variable has more categories than the maximum, the less common ones will be assigned to a new category “others”.
  • min_freq_categories: int. Minimum number of values that any categorical variable should have. The rest of the categories will be grouped into a single category called “Others - DM”.
  • num_cat: int. Only available when amplify='default'. Number of the most frequent values to consider as categorical in numerical variables. It is used to generate with better precision values that frequently appear in some numerical variables such as common prices, temperatures, amounts, the number zero, among others. If it is equal to 5 for example, it will treat the 5 most frequent values in the numerical variable as categorical.
  • cuda: bool. When True it will use GPU for computation (it has to be available in the system), otherwise CPU.
  • target: string. Name of the target variable to evaluate for the utility and predictive power score, based on the classical machine learning definition of target. If it's None, the algorithm selects a target randomlly from the available variables. If the target is numeric, bins will be created from it and passed as categorical. Dedomena fits baseline ML algorithms to analyze the utility and predictive power of the data, thus in many real scenarios the desired target is not possible to be predicted with the available dependent variables.
  • columns_mapping: dict. Dictionary specifying the mapping for the column names to assign which columns to use for: user_id, cat_id, concept, txn_date, amount, balance. If a column is not specified, it will be searched in the dataset with the default name. The dictionary must have the following keys equivalent to the initial definition.
  • transform_descriptions: string. None, 'level1', 'level2', 'level3'. Only available for generic and transactional algorithms and a concept/description variable.

    • None: The descriptions will not be transformed while synthesizing and therefore the same descriptions of the real data will be generated.
    • level1: It will synthesize dates, card numbers, account numbers (IBANs) and amounts present in the text of the descriptions.
    • level2: Everything in level1, plus it will synthesize names of people, addresses, cities, present in the text of the descriptions.
    • level3: Everything in level2, plus it will synthesize merchants names present in the text of the descriptions. It will generate merchants from similar industries (ex: McDonalds / Burguer King / Five Guys, Iberia / Air Europe / Ryanair).
  • balance_updated: bool or None. Only for transactional algorithm. If there is in the data a balance variable named "balance" and balance_updated=True, then the balance corresponding to each transaction will be considered to be updated, that is, balance[i]=balance[i-1]+amount[i]. If balance_updated=False, the balance will be considered not updated, balance[i]=balance[i-1]+amount[i-1]. It will be considered that there are no missing transactions, that is, all the balances generated will meet one of the previous conditions. If balance_updated=None, the balance will be considered as a float column.

  • constraints: list. List of strings defining constraints between columns. For example:

    • col1<->col2<->col3: Columns col1, col2, and col3 have fixed combinations, such as description, subcategory, and category. Any number of columns can be defined.
    • col1<=col2: All values in col1 must be less than or equal to col2.
    • col1>10: All values in col1 must be greater than 10.
    • col1>0: All values in col1 must be positive.
    • col1<col2<col3: All values in col2 must be less than those in col3 and greater than those in col1.
    • col1//1000: All values in col1 must be divisible by 1000.
  • sensitive: dict. Specify the sensitive variables and their sensitive type. Formed by a dictionary where the keys are the name of the sensitive variable and the value is the type.

The supported types of sensitive variables are:

  • address: A physical or mailing address.
  • city: A city name.
  • country: A country name.
  • country_code: An 'alpha-2' country code.
  • postcode: A postal code.
  • street_address: A street address composed of the number and the name.
  • street_name: A street name.
  • license_plate: A car license plate.
  • vin: A vehicle identification number (VIN).
  • aba: An American Bankers Association routing number (ABA).
  • bank_country: An ISO 3166-1 alpha-2 country code of the bank provider.
  • bban: A Basic Bank Account Number (BBAN)
  • iban: An International Bank Account Number (IBAN).
  • swift: A randomly generated Society for Worldwide Interbank Financial Telecommunication (SWIFT) code of variable length.
  • swift11: An 11-digit Society for Worldwide Interbank Financial Telecommunication (SWIFT) code.
  • swift8: An 8-digit Society for Worldwide Interbank Financial Telecommunication (SWIFT) code.
  • ean: An European Article Number (EAN) of variable length.
  • ean13: A 13-digit European Article Number (EAN).
  • ean8: A 8-digit European Article Number (EAN).
  • localized_ean: A localized EAN barcode.
  • localized_ean13: A 13-digit localized EAN barcode.
  • localized_ean8: A 8-digit localized EAN barcode.
  • company: A company name.
  • credit_card_expire: A credit card expiry date.
  • credit_card_full: A complete credit card, composed of the provider, associated name, card number, expiration date, and CVC code.
  • credit_card_number: A credit card number.
  • credit_card_provider: A credit card provider.
  • credit_card_security_code: A credit card security code.
  • coordinate: A coordinate.
  • latitude: A latitude.
  • longitude: A longitude.
  • company_email: An company email.
  • domain_name: An Internet domain name.
  • email: An email address.
  • hostname: A domain name assigned to a host computer.
  • ipv4: A random IPv4 address or network with a valid CIDR.
  • ipv4_private: A private IPv4.
  • ipv4_public: A public IPv4 excluding private blocks.
  • ipv6: A IPv6 address or network with a valid CIDR.
  • mac_address: A MAC address.
  • url: A Uniform Resource Locator (URL).
  • passport_number: A passport number.
  • first_name: A first name.
  • last_name: A last name.
  • name: A person's first and last name
  • country_calling_code: A country calling code.
  • msisdn: An MSISDN code.
  • phone_number: A landline or cellphone number.
  • ssn: A US social security number.

Sensitive variables of any other type will be replaced by 0, 1, 2, ...

Variables and specific parameters

Transactional

This algorithm was intended to be used for transactional datasets, therefore there are certain essential columns for it to work correctly. These columns are:

  • Mandatory:
    • txn_date (transaction date) in date_columns.
    • amount (transaction amount) in float_columns.
    • concept (transaction description/concept or description id) in categorical_columns.
  • Optional:
    • user_id (User ID) in categorical_columns.
    • cat_id (Transaction Spending or Income Category ID) in categorical_columns.
    • balance (Balance before or after the transaction) in float_columns.

Note that it is important that if the dataset does not contain one of these columns, some others must be transformed. To do this you can transform these columns names in the dataset or use the columns_mapping parameter. For example, if you want to use the dataset shown in the previous example the recommended changes are:

columns_mapping = {'cat_id':'BRANCH_ID',
                   'user_id':'SUPPLIER_ID', 
                   'amount':'LTV', 
                   'txn_date':'DISBURSAL_DATE'}
Timeseries
  • time_step: string. Time step of the timeserie. Available values: Time series / date functionality — pandas 2.0.1 documentation. Fixed difference in time between the dates of the timeseries. Default value is 'MS'. Value 'MS' means there is a month between two consecutive dates. IMPORTANT: for each time_step there must be only one observation (example: if daily ('D') only one observation/row per day). Default value is 'MS'.
  • series_length: int. Define the length of the timeseries based on the time_step. It represents the number of time_step contained in the dataset. For example, if time_step='MS', then series_length should be a multiple of 12 (12, 24, 36, 48), make sure the data corresponds to these parameters. Not all series in the data need to have observations for complete years (for example: could have observations for 11 months from the 12 of that specific year). Default value is 12.
  • static_columns: list. These are the columns that doesn't change over time and are the same for all the series for the same user_id.
Relational

This algorithm has other required parameters necessary for its proper functioning. In order to enhance the quality and efficiency of the algorithm, the number of epochs is set to 100 by default and can't be modified.

  • datasets_country: string. Specifies the country of origin of the data. It is used to generate sensitive variables.
  • datasets_config: dictionary. Define the configuration of each table. The dictionary must have the following keys equivalent to the initial definition:

    1. data_dir: string. Path to the dataset file.
    2. data_format: string. Input data format. Four options: CSV, PARQUET, MTX or DATABASE.
    3. categorical_columns: array. Discrete columns' names in the dataset.
    4. date_columns: array. Date columns' names in the dataset.
    5. integer_columns: array. Integer columns' names in the dataset.
    6. boolean_columns: array. Boolean columns' names in the dataset.
    7. float_columns: array. Float columns' names in the dataset.
    8. target: string. Name of the target variable to evaluate for the utility and predictive power score, based on the classical machine learning definition of target. If it's None, the algorithm selects a target randomlly from the available variables. If the target is numeric, bins will be created from it and passed as categorical. Dedomena fits baseline ML algorithms to analyze the utility and predictive power of the data, thus in many real scenarios the desired target is not possible to be predicted with the available dependent variables.
    9. primary_key: str. Primary key of the table.
    10. foreign_key: dict. Specify the foreign keys. Formed by a dictionary where the keys are the name of the foreign key and the value is the table from which it comes.

Train the synthesizer

After parameters are all set up for the algorithm selected from available options, it is time to train it and create the synthesizer. During the training time, Dedomena algorithms learn data patterns, statistical distributions, correlations, and time dependencies. In the process the privacy, quality and utility checks are performed and the QA report is also generated. After the run is completed, the resulting synthesizer is packed in an encrypted zip file that will then be uploaded to Dedomena´s servers to generate synthetic copies of the data.

Upload the synthesizer and generate synthetic data

When a run is completed and the synthesizer created, users need to upload the encrypted file into Dedomena´s platform through the SYNTHESIZERS section. Through the button ADD SYNTHESIZER users can upload the encrypted file to register the synthesizer with the associated information, metrics, report and generate artificial data. Work with confidence, data quality is assured.

Code Examples by Algorithm

Generic Example

from nucleus import synthesizer

synthesizer(data_dir='dir/folder/train.parquet',
            data_format='PARQUET', 
            token='1234', 
            algorithm='generic',
            batch_size=256, 
            epochs=150, 
            synthesizer_name='my_synthesizer', 
            synthesizer_description='kaggle dataset based synthesizer', 
            impute=True, 
            categorical_columns=['BRANCH_ID', 'MANUFACTURER_ID', 'EMPLOYMENT_TYPE', 'STATE_ID', 
                                 'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_ACTIVE_ACCTS', 'PRI_OVERDUE_ACCTS', 
                                 'NEW_ACCTS_IN_LAST_SIX_MONTHS'], 
            integer_columns=['DISBURSED_AMOUNT', 'SUPPLIER_ID', 'CURRENT_PINCODE_ID', 'ASSET_COST', 
                             'PERFORM_CNS_SCORE', 'PRIMARY_INSTAL_AMT', 'NO_OF_INQUIRIES',
                             'PRI_CURRENT_BALANCE', 'PRI_SANCTIONED_AMOUNT', 'PRI_DISBURSED_AMOUNT'], 
            boolean_columns=['AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG', 'DRIVING_FLAG', 'PASSPORT_FLAG'], 
            float_columns=['LTV'], 
            output_dir='results/',
            max_categories=None,
            min_freq_categories=None,
            cuda=True,
            amplify="quality")

Transactional Example

Below you can find an example of the transactional algorithm using the same dataset from Kaggle: Vehicle Loan Default Prediction.

Example of execution:

Remember that you need to make the neccessary transformations in the dataset to make it compatible with the algorithm, or use the columns_mapping parameter.

For example, this should be the code to use the dataset from Kaggle:

from nucleus import synthesizer

synthesizer(data_dir='dir/folder/train.parquet',
            data_format='PARQUET', 
            token='1234', 
            algorithm='transactional',
            batch_size=256, 
            epochs=100, 
            synthesizer_name='my_synthesizer', 
            synthesizer_description='kaggle dataset based synthesizer', 
            impute=True, 
            categorical_columns=['BRANCH_ID', 'MANUFACTURER_ID', 'EMPLOYMENT_TYPE', 'STATE_ID', 
                                 'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_ACTIVE_ACCTS', 'PRI_OVERDUE_ACCTS', 
                                 'NEW_ACCTS_IN_LAST_SIX_MONTHS'], 
            date_columns=['SUPPLIER_ID'], 
            integer_columns=['SUPPLIER_ID', 'DISBURSED_AMOUNT', 'CURRENT_PINCODE_ID', 'ASSET_COST', 
                             'PERFORM_CNS_SCORE', 'PRIMARY_INSTAL_AMT', 'NO_OF_INQUIRIES',
                             'PRI_CURRENT_BALANCE', 'PRI_SANCTIONED_AMOUNT', 'PRI_DISBURSED_AMOUNT'], 
            boolean_columns=['AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG', 'DRIVING_FLAG', 'PASSPORT_FLAG'], 
            float_columns=['LTV'], 
            output_dir='results/',
            max_categories=None,
            min_freq_categories=None,
            columns_mapping: {'cat_id': 'BRANCH_ID',
                              'user_id': 'SUPPLIER_ID',
                              'amount': 'LTV',
                              'txn_date': 'DISBURSAL_DATE'},
            cuda=True,
            amplify="default")

Time Series Example

Example of execution:

from nucleus import synthesizer

synthesizer(data_dir='dir/folder/train.parquet', 
            data_format='PARQUET',
            token='1234', 
            algorithm='timeseries',
            batch_size=1024, 
            epochs=300, 
            synthesizer_name='event_txns_synthesizer', 
            synthesizer_description='event txns dataset based synthesizer', 
            categorical_columns=['A', 'B'], 
            date_columns=['txn_date'], 
            integer_columns=[], 
            boolean_columns=[], 
            float_columns=['C'],
            static_columns=['B'], 
            output_dir='results/',
            max_categories=None,
            cuda=True,
            time_step='MS', 
            series_length=12)

Relational Example

Example of a datasets_config:

datasets_config = {
                    'table1_name': {
                                    'data_dir': 'path/to/table1.csv',
                                    'data_format':'CSV',
                                    'algorithm': 'transactional',
                                    'columns_mapping':{
                                                        'user_id': 'col1',
                                                        'txn_date':'col3',
                                                        'concept':'col2',
                                                        'amount':'col6'
                                                        },
                                    'categorical_columns': ['col1', 'col2'],
                                    'date_columns': ['col3'],
                                    'integer_columns': ['col4'],
                                    'boolean_columns': ['col5'],
                                    'float_columns': ['col6'],
                                    'foreign_key': {'col7':'table2'},
                                    'primary_key':'col1',
                                    'sensitive': {'col2':'first_name'},
                                    'transform_descriptions':'level2',
                                    'target':'col2',
                                    'constraints':['col1<->col2<->col3']
                                    },
                    'table2_name': {
                                    'data_dir': 'path/to/table2.parquet',
                                    'data_format':'PARQUET',
                                    'algorithm': 'generic',
                                    'categorical_columns': ['col1', 'col2'],
                                    'date_columns': ['col3'],
                                    'integer_columns': ['col4'],
                                    'boolean_columns': ['col5'],
                                    'float_columns': ['col6'],
                                    'foreign_key': {},
                                    'primary_key':'col1',
                                    'sensitive': {'col2':'name'},
                                    'target': None,
                                    'impute':False
                                    },
                  }

Example of execution:

from nucleus import synthesizer

synthesizer(token='1234', 
            algorithm='relational',
            batch_size=256, 
            datasets_config=datasets_config,
            synthesizer_name='my_synthesizer', 
            synthesizer_description='kaggle dataset based synthesizer', 
            output_dir='results/',
            datasets_country = 'Spain',
            cuda=True)

Retrain a synthesizer

Nucleus Edge provides the ability to fine-tune pre-trained synthesizers. The input parameters to retrain a model are as follows:

  • data_dir: string. Path to the dataset file.
  • data_format: enum. Input data format. Two options: CSV or PARQUET.
  • token: string. The token provided by Dedomena to make operative Nucleus Edge.
  • epochs: int (values between 200-350 are recommended). Number of epochs to retrain.
  • model_dir: string. Path to the .zip file containing the pre-trained model.
  • impute: bool. Always True.
  • output_dir: string. Path where the encrypted file will be persisted.
  • max_categories: int. Maximum number of values that a categorical variable must have. If the variable has more categories than the maximum, the less common ones will be assigned to a new category “others”.
  • min_freq_categories: int. Minimum number of values that any categorical variable should have. The rest of the categories will be grouped into a single category called “Others - DM”.
  • cuda: bool. When True it will use GPU for computation (it has to be available in the system), otherwise CPU. (Even if the model has been trained on CPU, it can be retrained on GPU.)

An example of retraining a model is as follows:

from nucleus import retrain_synthesizer

retrain_synthesizer(data_dir='dir',
                    data_format='PARQUET', 
                    token='1234', 
                    epochs=10, 
                    model_dir='dir/synthesizer.zip',
                    impute=True, 
                    output_dir='results/',
                    max_categories=None, 
                    min_freq_categories=None,
                    cuda=False)

The retraining dataset can be different from the dataset used in the first training, however it must keep the same columns (with the same name) used in the training dataset.

The output will be a protected .zip with the retrained model.

Evaluate two datasets from Nucleus Edge

from nucleus import synthesizer_evaluations

privacy_score, quality_score, utility_score = synthesizer_evaluations(real_data='dir/', 
                                                                      synthetic_data='dir/',
                                                                      categorical_columns=[], 
                                                                      token = token,
                                                                      date_columns=[], 
                                                                      integer_columns=[], 
                                                                      boolean_columns=[], 
                                                                      float_columns=[],
                                                                      output_dir='dir/'
                                                                     )

API

Dedomena provides a REST API for consuming synthetic data from user synthesizers, enabling a variety of integrations and powering real-time use cases. It allows to generate 5000 synthetic rows per call. The API only works with synthesizers trained with the latest version of the available Nucleus Edge.

Advanced features for higher quality synthetic data

In this section, we will delve into some of the advanced features that can be used to generate higher quality synthetic data. Nucleus allows to create more accurate and effective synthetic data thanks to the following advanced features:

  • Outliers management: Outliers and noise are unavoidable in real life datasets and therefore they pose a privacy problem. The mechanism of managing outliers is another important contribution of Dedomena´s synthetic data generation software, finding a balance in the trade-off between privacy (completely removing outliers) against data quality/utility (don't lose relevant information, statistical properties) making them present in the synthetic data while avoiding privacy related issues. The synthesizers can produce outliers of controlled shape and distribution keeping the value of the real data.

  • Linear / non linear relationships: With all our synthesization algorithms you will be able to deliver a consistent synthetic data stream, maintaining both linear and non linear relationships in the data, which in turn lays the groundwork for more robust downstream analytics and AI-model training. Also, generated data will respect preconditions and dependencies among master-reference data and keep importing sequences properly.

  • Computational efficiency: The training of deep learning models that generate synthesizers is in general very compute-intensive, however the proprietary architecture of Nucleus allows to efficiently train them in a reasonable time, optimizing resources by limiting user interaction to the selection of a few parameters to create the syhthesizers, GPU acceleration, synthesizers fine-tuning, etc, depending on the desired quality of the dataset and its application.

  • Referential integrity: Our solution kept the integrity through variables and datasets as in the original data, being all the synthetic references valid. Also, the synthetic primary and foreign keys maintain the structure and implicit behaviour of the real entities or users represented in the different feature spaces from all the datasets. Nucleus is able to synthesize two or more linked tables, preserving the relationships, inter-table patterns and statistical information among the different sets of data. Dedomena AI provides an extra feature to allow datasets from different countries to be mixed with existing reference and master data, even when the synthetization process is performed in separate locations and environments.

  • Bivariate and trivariate analysis: Nucleus is capable of replicating the structure, and probabilistic distributions, as well as the bivariate and multivariate empirical relationships of the real datasets variables. Statistical distributions are preserved with a consistency of at least 99%, while bivariate and trivariate level relationships are preserved with a consistency of at least 98% and 96% respectively.

  • Conditional Data Generation: [Coming Soon] Nucleus algorithms support model conditioning, which enables the synthesizer to generate more records that match a certain class or label, versus simply recreating the distribution that it was trained on. This feature can be used to balance class distributions in datasets for more accurate or ethically fair machine learning.


Follow Dedomena AI on LinkedIn