---
title: "PUcopulaSynth: end-to-end workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{PUcopulaSynth: end-to-end workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

`PUcopulaFit` provides a compact pipeline to:

1.  preprocess a data frame (factor handling and dummy encoding),

2.  fit a **Partition-of-Unity (PU) copula** dependence model,

3.  estimate marginal distributions (logspline or empirical),

4.  generate synthetic data, and

5.  restore original factor structure.

No DataSHIELD and no Python dependencies.

## Installation

```{r eval=FALSE}
# if not already installed
install.packages(c("logspline","RANN","dplyr","tidyselect","caret"))
# then
devtools::install_github("amaendle/PUcopula")
devtools::install()
```

## A tiny example

We’ll create a small toy dataset with mixed types.

```{r}
set.seed(1)
toy <- data.frame(
x = rnorm(200),
y = rpois(200, 2),
z = factor(sample(c("a","b","c"), 200, TRUE)),
w = ordered(sample(1:4, 200, TRUE))
)
str(toy)
```

## 1) Preprocess

-   Multi-level **unordered** factors become dummy variables (tagged with `.cat.`).

-   Multi-level **ordered** factors may become `.lev.`.

-   Remaining factor columns are kept and renamed with `.oriname`; their **levels** are recorded.

```{r setup}
library(PUcopulaSynth)

pre <- preprocessData(toy)
names(pre$data)[1:6]
str(pre$original_levels)

```

## 2) Fit the PU copula

You can control:

-   `driver_strength_factor` (row-based scale for driver strength; scalar or per-variable),

-   `bin_size` (rank-binning smoothness; scalar, vector or named list),

-   `jitter` (FALSE, single numeric, or named list), and

-   `family` (e.g. `"binom"` or `"nbinom"`).

```{r}
cop <- fitPUcopula(
data = pre$data,
driver_strength_factor = 0.5,
bin_size = 3,
jitter = FALSE,
family = "binom"
)
cop
```

## 3) Estimate marginals

-   Numeric/ordered variables use **logspline** by default.

-   Binary/trivial variables fall back to **empirical probability tables**.

-   Optional k-NN smoothing via `k`.

```{r}
marg <- estimateMarginals(pre$data, method = "spline", k = 3)
names(marg)
```

## 4) Generate synthetic data

-   Combines copula draws with the marginals’ inverse transformations.

-   Optionally restores factor structure using `original_levels`, `original_varnames`, and `original_classes`.

```{r}
syn <- generateSynthetic(
n = 1000,
copula = cop,
marginals = marg,
original_levels = pre$original_levels,
original_varnames = names(toy),
original_classes  = sapply(toy, class)
)

str(syn)
head(syn)

```

## 5) Quick checks

Compare simple summaries between original and synthetic.

```{r}
summary(toy)
summary(syn)
```

You can also visualize marginal distributions:

```{r}
op <- par(mfrow = c(1,2))
hist(toy$x, main = "Original x", xlab = "x")
hist(syn$x, main = "Synthetic x", xlab = "x")
par(op)
```

## Tips & notes

-   If your factors have only 2 levels, they’re treated as **binary** and modeled via empirical probabilities.

-   For small samples, consider slightly larger `bin_size` or modest `jitter` to stabilise fits.

-   For integers in the original data, `generateSynthetic()` rounds and casts back to `integer`.

## Session info

```{r}
sessionInfo()
```