Summarizes non-categorical variables in a dataframe by grouping them based on specified categorical variables and returns the aggregated result along with the tidyverse code used to generate it.
Arguments
- data
A dataframe or survey design object to be aggregated.
- group_vars
A character vector specifying the variables in
data
to ' be used as grouping factors.- summaries
An unnamed character vector or named list of summary functions to calculate for each group. If unnamed, the vector elements should be names of variables in the dataset for which summary statistics need to be calculated. If named, the names should correspond to the summary functions (e.g., "mean", "sd", "iqr") to be applied to each variable.
- vars
(Optional) A character vector specifying the names of variables in the dataset for which summary statistics need to be calculated. This argument is ignored if
summaries
is a named list.- names
(Optional) A character vector or named list providing name templates for the newly created variables. See details for more information.
- quantiles
(Optional) A numeric vector specifying the desired quantiles (e.g., c(0.25, 0.5, 0.75)). See details for more information.
- dt
A character string representing the name of the date-time variable in the dataset.
- dt_comp
A character string specifying the component of the date-time to use for grouping.
Value
An aggregated dataframe containing the summary statistics for each group, along with the tidyverse code used for the aggregation.
Details
The aggregate_data()
function accepts any R function that returns a
single-value summary (e.g., mean
, var
, sd
, sum
, IQR
). By default,
new variables are named {var}_{fun}
, where {var}
is the variable name
and {fun}
is the summary function used. The user can provide custom names
using the names
argument, either as a vector of the same length as vars
,
or as a named list where the names correspond to summary functions (e.g.,
"mean" or "sd").
The special summary "missing" can be included, which counts the number of
missing values in the variable. The default name for this summary is
{var}_missing
.
If quantiles
are requested, the function calculates the specified
quantiles (e.g., 25th, 50th, 75th percentiles), creating new variables for
each quantile. To customize the names of these variables, use {p}
as a
placeholder in the names
argument, where {p}
represents the quantile
value. For example, using names = "Q{p}_{var}"
will create variables like
"Q0.25_Sepal.Length" for the 25th percentile.
Examples
aggregated <-
aggregate_data(iris,
group_vars = c("Species"),
summaries = c("mean", "sd", "iqr")
)
code(aggregated)
#> iris |>
#> dplyr::group_by(Species) |>
#> dplyr::summarise(
#> Sepal.Length_mean = mean(Sepal.Length,
#> na.rm = TRUE
#> ),
#> Sepal.Width_mean = mean(Sepal.Width,
#> na.rm = TRUE
#> ),
#> Petal.Length_mean = mean(Petal.Length,
#> na.rm = TRUE
#> ),
#> Petal.Width_mean = mean(Petal.Width,
#> na.rm = TRUE
#> ),
#> Sepal.Length_sd = sd(Sepal.Length,
#> na.rm = TRUE
#> ),
#> Sepal.Width_sd = sd(Sepal.Width,
#> na.rm = TRUE
#> ),
#> Petal.Length_sd = sd(Petal.Length,
#> na.rm = TRUE
#> ),
#> Petal.Width_sd = sd(Petal.Width,
#> na.rm = TRUE
#> ),
#> Sepal.Length_iqr = IQR(Sepal.Length,
#> na.rm = TRUE
#> ),
#> Sepal.Width_iqr = IQR(Sepal.Width,
#> na.rm = TRUE
#> ),
#> Petal.Length_iqr = IQR(Petal.Length,
#> na.rm = TRUE
#> ),
#> Petal.Width_iqr = IQR(Petal.Width,
#> na.rm = TRUE
#> ),
#> .groups = "drop"
#> )
#>
head(aggregated)
#> # A tibble: 3 × 13
#> Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 3.43 1.46 0.246
#> 2 versico… 5.94 2.77 4.26 1.33
#> 3 virgini… 6.59 2.97 5.55 2.03
#> # ℹ 8 more variables: Sepal.Length_sd <dbl>, Sepal.Width_sd <dbl>,
#> # Petal.Length_sd <dbl>, Petal.Width_sd <dbl>, Sepal.Length_iqr <dbl>,
#> # Sepal.Width_iqr <dbl>, Petal.Length_iqr <dbl>, Petal.Width_iqr <dbl>