Compute Weighted Percentiles cross a List of Data Frames (with optional grouping)
run_weighted_percentiles.Rd
Applies the compute_weighted_percentiles()
function to each data frame in a named list, using a specified variable and optional weight.
Supports optional grouping via a categorical by
variable, and returns either percentile values or share values, depending on the share
argument.
Arguments
- data_list
A named list of data frames, (e.g., across countries or years).
- var_name
A string specifying the variable name (e.g., "dhi", "pilabour") to compute percentiles or shares for.
- wgt_name
An optional string specifying the weight variable to be used. If
NULL
, equal weights are assumed.- probs
A numeric vector of probabilities with values between 0 and 1, defining percentiles (if
share = FALSE
) or the brackets between which shares are computed (ifshare = TRUE
).- type
A character string indicating which percentile definition to use. #' Either
"type_4"
(default, linear interpolation-based of the empirical cdf - continuous sample quantile) or"type_2"
(used in Stata commands like collapse and _pctile, inverse of empirical distribution function with averaging at discontinuities - discontinuous sample quantile).Logical. If
TRUE
, returns income shares between percentile brackets instead of the percentile values. Default isFALSE
. Note: This always usesdefinition = "type_4"
(interpolation), regardless of thedefinition
parameter. It cannot be combined withdefinition = "type_2"
.- na.rm
Logical. If
TRUE
, missing values invar_name
orwgt_name
are removed.- by
Optional string giving the name of a categorical variable to split the data within each data frame before computing statistics.
Value
A named list of data frames. If by
is used, unless if share = FALSE
and length(probs) ==1
, the result will be a nested list where the outer list is by data frame and the inner list is by subgroup.
If by
is NULL
, the result is a simple list of numeric vectors (one per data frame).
Details
Percentiles are computed using weighted version of quantile definition 4 from Hyndman and Fan (1996), by default, or quantile definition 2 if specified.
When share = TRUE
, the function estimates Lorenz ordinates by taking quantiles from the running sum of the ordered outcome variable (divided by the total),
according to the same quantile definition 4 only.
Examples
if (FALSE) { # \dontrun{
# Import data
my_data_list <- lissyuse(data = "es", from = 2016)
# Retrieve the percentile estimates
percentiles_result <- my_data_list %>%
run_weighted_percentiles(
var_name = "dhi",
wgt_name = "hpopwgt",
probs = seq(0.1, 0.9, 0.1),
type = "type_2",
na.rm = TRUE
)
print(percentiles_result)
# Compute the distribution shares
shares_result <- run_weighted_percentiles(
data_list = my_data_list,
var_name = "dhi",
wgt_name = "hpopwgt",
probs = seq(0, 1, 0.1),
share = TRUE,
na.rm = TRUE
)
print(shares_result)
# Using the by option
by_result_median <- run_weighted_percentiles(
data_list = purrr::map(my_data_list[1:2], ~.x %>% filter(emp == 1)),
var_name = "pi11",
wgt_name = "ppopwgt",
probs = 0.5,
type = "type_4",
na.rm = TRUE,
by = "region_c"
)
print(by_result_median)
} # }