|
--- |
|
license: mit |
|
datasets: |
|
- demo-org/diabetes |
|
- scikit-learn/adult-census-income |
|
- leostelon/california-housing |
|
- vitaliykinakh/heloc |
|
- vitaliykinakh/sick |
|
- vitaliykinakh/travel |
|
metrics: |
|
- accuracy |
|
--- |
|
|
|
This repository contains the official models from the paper "[Tabular Data Generation using Binary Diffusion](https://arxiv.org/abs/2409.13882)", |
|
accepted to [3rd Table Representation Learning Workshop @ NeurIPS 2024](https://table-representation-learning.github.io/). |
|
|
|
# Abstract |
|
|
|
Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. |
|
Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed |
|
data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we |
|
introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary |
|
representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary |
|
data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary |
|
cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter |
|
tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, |
|
demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes |
|
datasets while being significantly smaller in size. |
|
|
|
# Results |
|
|
|
The table below presents the **Binary Diffusion** results across various datasets and models. Performance metrics are shown as **mean ± standard deviation**. |
|
|
|
| **Dataset** | **LR (Binary Diffusion)** | **DT (Binary Diffusion)** | **RF (Binary Diffusion)** | **Params** | |
|
|-------------------------|---------------------------|---------------------------|---------------------------|------------| |
|
| **Travel** | **83.79 ± 0.08** | **88.90 ± 0.57** | **89.95 ± 0.44** | **1.1M** | |
|
| **Sick** | 96.14 ± 0.63 | **97.07 ± 0.24** | 96.59 ± 0.55 | **1.4M** | |
|
| **HELOC** | 71.76 ± 0.30 | 70.25 ± 0.43 | 70.47 ± 0.32 | **2.6M** | |
|
| **Adult Income** | **85.45 ± 0.11** | **85.27 ± 0.11** | **85.74 ± 0.11** | **1.4M** | |
|
| **Diabetes** | **57.75 ± 0.04** | **57.13 ± 0.15** | 57.52 ± 0.12 | **1.8M** | |
|
| **California Housing** | *0.55 ± 0.00* | 0.45 ± 0.00 | 0.39 ± 0.00 | **1.5M** | |
|
|
|
--- |
|
|
|
# Citation |
|
``` |
|
@article{kinakh2024tabular, |
|
title={Tabular Data Generation using Binary Diffusion}, |
|
author={Kinakh, Vitaliy and Voloshynovskiy, Slava}, |
|
journal={arXiv preprint arXiv:2409.13882}, |
|
year={2024} |
|
} |
|
``` |