--- title: "PUcopulaSynth: end-to-end workflow" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PUcopulaSynth: end-to-end workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview `PUcopulaFit` provides a compact pipeline to: 1. preprocess a data frame (factor handling and dummy encoding), 2. fit a **Partition-of-Unity (PU) copula** dependence model, 3. estimate marginal distributions (logspline or empirical), 4. generate synthetic data, and 5. restore original factor structure. No DataSHIELD and no Python dependencies. ## Installation ```{r eval=FALSE} # if not already installed install.packages(c("logspline","RANN","dplyr","tidyselect","caret")) # then devtools::install_github("amaendle/PUcopula") devtools::install() ``` ## A tiny example We’ll create a small toy dataset with mixed types. ```{r} set.seed(1) toy <- data.frame( x = rnorm(200), y = rpois(200, 2), z = factor(sample(c("a","b","c"), 200, TRUE)), w = ordered(sample(1:4, 200, TRUE)) ) str(toy) ``` ## 1) Preprocess - Multi-level **unordered** factors become dummy variables (tagged with `.cat.`). - Multi-level **ordered** factors may become `.lev.`. - Remaining factor columns are kept and renamed with `.oriname`; their **levels** are recorded. ```{r setup} library(PUcopulaSynth) pre <- preprocessData(toy) names(pre$data)[1:6] str(pre$original_levels) ``` ## 2) Fit the PU copula You can control: - `driver_strength_factor` (row-based scale for driver strength; scalar or per-variable), - `bin_size` (rank-binning smoothness; scalar, vector or named list), - `jitter` (FALSE, single numeric, or named list), and - `family` (e.g. `"binom"` or `"nbinom"`). ```{r} cop <- fitPUcopula( data = pre$data, driver_strength_factor = 0.5, bin_size = 3, jitter = FALSE, family = "binom" ) cop ``` ## 3) Estimate marginals - Numeric/ordered variables use **logspline** by default. - Binary/trivial variables fall back to **empirical probability tables**. - Optional k-NN smoothing via `k`. ```{r} marg <- estimateMarginals(pre$data, method = "spline", k = 3) names(marg) ``` ## 4) Generate synthetic data - Combines copula draws with the marginals’ inverse transformations. - Optionally restores factor structure using `original_levels`, `original_varnames`, and `original_classes`. ```{r} syn <- generateSynthetic( n = 1000, copula = cop, marginals = marg, original_levels = pre$original_levels, original_varnames = names(toy), original_classes = sapply(toy, class) ) str(syn) head(syn) ``` ## 5) Quick checks Compare simple summaries between original and synthetic. ```{r} summary(toy) summary(syn) ``` You can also visualize marginal distributions: ```{r} op <- par(mfrow = c(1,2)) hist(toy$x, main = "Original x", xlab = "x") hist(syn$x, main = "Synthetic x", xlab = "x") par(op) ``` ## Tips & notes - If your factors have only 2 levels, they’re treated as **binary** and modeled via empirical probabilities. - For small samples, consider slightly larger `bin_size` or modest `jitter` to stabilise fits. - For integers in the original data, `generateSynthetic()` rounds and casts back to `integer`. ## Session info ```{r} sessionInfo() ```