| Title: | DataSHIELD Server Functions for Partition-of-Unity Copula Based Synthesis |
|---|---|
| Description: | Implements the server-side components required to generate privacy-protecting synthetic data in the DataSHIELD infrastructure. The package orchestrates the preprocessing, copula fitting, synthetic data generation and privacy scoring workflows by combining R and Python tooling. |
| Authors: | Andreas Mändle [aut, cre] |
| Maintainer: | Andreas Mändle <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-21 10:49:35 UTC |
| Source: | https://github.com/amaendle/dsPUcopula |
Given the preprocessed data this function fits marginal models for each
column. Continuous variables are modelled using logspline densities and
discrete variables through empirical probability tables.
estimateMarginalsDS(data_str, method = "spline", k = 3)estimateMarginalsDS(data_str, method = "spline", k = 3)
data_str |
Character name of the processed data object on the server. |
method |
Character vector controlling the estimation method for numeric and ordered categorical variables. Currently only "spline" is supported for numeric variables. |
k |
Numeric or list specifying the smoothing neighbourhood used before fitting the marginals. |
A named list of marginal models compatible with
generateSyntheticDS().
fitPUcopulaDS() is the main server-side entry point used by DataSHIELD to
build the Partition-of-Unity (PU) copula-based synthesiser. It evaluates the
symbol provided by the client, optionally applies jittering and binning to
satisfy disclosure control constraints, and fits a
PUcopula::PUCopula model describing the dependence structure.
fitPUcopulaDS( data_str, driver_strength_factor = 0.5, bin_size = 3, jitter = FALSE, family = "binom" )fitPUcopulaDS( data_str, driver_strength_factor = 0.5, bin_size = 3, jitter = FALSE, family = "binom" )
data_str |
Character name of the processed server-side training data object. |
driver_strength_factor |
Numeric scalar or vector controlling the driver
strength passed to |
bin_size |
Numeric or list specifying the smoothing bin size applied to the ranks of each variable. When a single value is provided it is recycled across variables. |
jitter |
Logical, numeric or named list controlling the amount of numeric jittering applied to the input columns. |
family |
Character string selecting the driver distribution passed to
|
This function represents the first step of the synthetic data generation
workflow. The fitted copula can later be combined with separately estimated
marginal distributions (see estimateMarginalsDS()) to simulate synthetic
data.
The function operates on server-side data within a DataSHIELD environment.
Prior to fitting, optional smoothing (via bin_size) and perturbation (via
jitter) can be applied to reduce disclosure risks.
Only the copula model is estimated here. Marginal distributions must be
fitted separately using estimateMarginalsDS(). Synthetic data can then be
generated using generateSyntheticDS(), optionally including privacy and
utility scores.
A fitted PUcopula::PUCopula object representing the dependence
structure of the data.
estimateMarginalsDS(), simulateCopulaDS(),
generateSyntheticDS(), PUcopula::PUCopula()
Combines the fitted copula and marginal models to draw synthetic data. The helper also orchestrates the privacy evaluation workflow by calling the Python-based scoring helpers when requested.
generateSyntheticDS( n = "n_rSynthetic", copula_str = "PU_copula_model", marginals_str = "marginal_models", training_data = "D_ori", control_data = "D_control", singling_out_check = TRUE, inference_check = TRUE, inference_check_ignore_na = FALSE, syndat_scores = TRUE, return_scores = FALSE )generateSyntheticDS( n = "n_rSynthetic", copula_str = "PU_copula_model", marginals_str = "marginal_models", training_data = "D_ori", control_data = "D_control", singling_out_check = TRUE, inference_check = TRUE, inference_check_ignore_na = FALSE, syndat_scores = TRUE, return_scores = FALSE )
n |
Number of records to generate or a character name pointing to an object containing that number. |
copula_str |
Character name of the fitted copula object. |
marginals_str |
Character name of the list of fitted marginal models. |
training_data |
Character name of the training data set used to fit the synthesiser. |
control_data |
Character name of the hold-out control data set. |
singling_out_check |
Logical; when |
inference_check |
Logical; when |
inference_check_ignore_na |
Logical; passed to
|
syndat_scores |
Logical; evaluate utility scores using
|
return_scores |
Logical; when |
A synthetic data.frame. When return_scores = TRUE a list with the
synthetic data and the associated privacy/utility scores.
Undo the dummy encoding produced by preprocessDataDS() by projecting the
synthetic dummy variables back to the closest original factor levels.
postprocessDataDS(data_str, cat_dummy_levels_str)postprocessDataDS(data_str, cat_dummy_levels_str)
data_str |
Character name of the data object that contains the dummy encoded variables. |
cat_dummy_levels_str |
Character name of the metadata object storing the
dummy variable structure returned by |
A data.frame whose factor variables are restored to their original
levels and ordering.
Functions that capture the original structure of the training data, prepare it for copula fitting and later restore the factor levels of the synthetic outputs.
data_str |
Character name of the data object stored on the DataSHIELD server. |
cat_dummy_levels_str |
Character name of the metadata object returned by
|
save_original_varnamesDS() returns a character vector of column names.
save_original_classesDS() returns the associated classes.
preprocessDataDS() returns a list containing the processed data set and
metadata describing factor levels.
postprocessDataDS() returns a data.frame with factor variables
reconstructed from dummy encodings.
This helper applies a set of preprocessing steps. It converts categorical variables into dummy variables, stores the original levels and renames factor columns so that post-processing can restore the input structure.
preprocessDataDS(data_str)preprocessDataDS(data_str)
data_str |
Character name of the data object on the server. |
A list with the processed data (data) and the original_levels
metadata required by postprocessDataDS().
The inference attack estimates the risk of inferring secret attributes from synthetic records using auxiliary quasi-identifiers. This function acts as a thin wrapper around the Anonymeter Python API.
py_anonymeter_Inference( ori, syn, aux_cols, secret, inference_check_ignore_na = FALSE, control = NULL, return_evaluator = FALSE )py_anonymeter_Inference( ori, syn, aux_cols, secret, inference_check_ignore_na = FALSE, control = NULL, return_evaluator = FALSE )
ori |
The training data set used to fit the synthesis model. |
syn |
The synthetic data set to be assessed. |
aux_cols |
Character vector naming the auxiliary columns used by the attacker. |
secret |
Character scalar giving the sensitive column for which the disclosure risk should be evaluated. |
inference_check_ignore_na |
Logical; when |
control |
Optional control data set used to benchmark the attack. |
return_evaluator |
Logical; when |
Either the list of risk metrics returned by Anonymeter or the
evaluator object when return_evaluator = TRUE.
This function calls the Anonymeter Python package to quantify the singling out risk of synthetic data. It is primarily used on the DataSHIELD server to evaluate privacy risks before releasing data to the client.
py_anonymeter_SinglingOut(ori, syn, control = NULL, return_evaluator = FALSE)py_anonymeter_SinglingOut(ori, syn, control = NULL, return_evaluator = FALSE)
ori |
The training data set used to fit the synthesis model. |
syn |
The synthetic data set to be assessed. |
control |
An optional control data set. When supplied the control population is used as a baseline for the attack simulation. |
return_evaluator |
Logical; when |
Either a list with risk metrics (the default) or the evaluator object
returned by Anonymeter when return_evaluator = TRUE.
syndat Python moduleThe function bridges to the syndat Python package to evaluate different
quality metrics between an original and a synthetic data set.
py_syndat_scores(ori, syn, control = NULL)py_syndat_scores(ori, syn, control = NULL)
ori |
A |
syn |
A |
control |
A |
A named list with distribution, discrimination and correlation scores for both the original and the control data.
Similar to save_original_varnamesDS(), this helper stores the original
column classes so that the synthetic output can be coerced back to the same
types once post-processing is completed.
save_original_classesDS(data_str)save_original_classesDS(data_str)
data_str |
Character name of the data object on the server. |
A character vector describing the classes of the input columns.
Helper used in the DataSHIELD workflow to store the original variable names before preprocessing steps modify them.
save_original_varnamesDS(data_str)save_original_varnamesDS(data_str)
data_str |
Character name of the data object on the server. |
A character vector with the column names of the original data.
This helper creates or updates the Python environment declared in the
package's Config/reticulate field so that the Python packages required for
the disclosure control checks are available.
setup_python()setup_python()
Invisibly returns TRUE when the environment has been configured.
reticulate::configure_environment()
Generates random variates from the server-side copula model previously
stored in PU_copula_model.
simulateCopulaDS(n)simulateCopulaDS(n)
n |
Number of samples to generate. |
A matrix of simulated copula draws.