Aggregate data by categorical variables

Summarizes non-categorical variables in a dataframe by grouping them based on specified categorical variables and returns the aggregated result along with the tidyverse code used to generate it.

Usage

aggregate_data(
  data,
  group_vars,
  summaries,
  vars = NULL,
  names = NULL,
  quantiles = c(0.25, 0.75)
)

aggregate_dt(
  data,
  dt,
  dt_comp,
  group_vars = NULL,
  summaries,
  vars = NULL,
  names = NULL,
  quantiles = c(0.25, 0.75)
)

Arguments

data: A dataframe or survey design object to be aggregated.
group_vars: A character vector specifying the variables in data to ' be used as grouping factors.
summaries: An unnamed character vector or named list of summary functions to calculate for each group. If unnamed, the vector elements should be names of variables in the dataset for which summary statistics need to be calculated. If named, the names should correspond to the summary functions (e.g., "mean", "sd", "iqr") to be applied to each variable.
vars: (Optional) A character vector specifying the names of variables in the dataset for which summary statistics need to be calculated. This argument is ignored if summaries is a named list.
names: (Optional) A character vector or named list providing name templates for the newly created variables. See details for more information.
quantiles: (Optional) A numeric vector specifying the desired quantiles (e.g., c(0.25, 0.5, 0.75)). See details for more information.
dt: A character string representing the name of the date-time variable in the dataset.
dt_comp: A character string specifying the component of the date-time to use for grouping.

Value

An aggregated dataframe containing the summary statistics for each group, along with the tidyverse code used for the aggregation.

Details

The aggregate_data() function accepts any R function that returns a single-value summary (e.g., mean, var, sd, sum, IQR). By default, new variables are named {var}_{fun}, where {var} is the variable name and {fun} is the summary function used. The user can provide custom names using the names argument, either as a vector of the same length as vars, or as a named list where the names correspond to summary functions (e.g., "mean" or "sd").

The special summary "missing" can be included, which counts the number of missing values in the variable. The default name for this summary is {var}_missing.

If quantiles are requested, the function calculates the specified quantiles (e.g., 25th, 50th, 75th percentiles), creating new variables for each quantile. To customize the names of these variables, use {p} as a placeholder in the names argument, where {p} represents the quantile value. For example, using names = "Q{p}_{var}" will create variables like "Q0.25_Sepal.Length" for the 25th percentile.

Functions

aggregate_dt(): Aggregate data by dates and times

Author

Tom Elliott, Owen Jin, Zhaoming Su

Zhaoming Su

Examples

aggregated <-
    aggregate_data(iris,
        group_vars = c("Species"),
        summaries = c("mean", "sd", "iqr")
    )
code(aggregated)
#> iris |>
#>     dplyr::group_by(Species) |>
#>     dplyr::summarise(
#>         Sepal.Length_mean = mean(Sepal.Length,
#>             na.rm = TRUE
#>         ),
#>         Sepal.Width_mean = mean(Sepal.Width,
#>             na.rm = TRUE
#>         ),
#>         Petal.Length_mean = mean(Petal.Length,
#>             na.rm = TRUE
#>         ),
#>         Petal.Width_mean = mean(Petal.Width,
#>             na.rm = TRUE
#>         ),
#>         Sepal.Length_sd = sd(Sepal.Length,
#>             na.rm = TRUE
#>         ),
#>         Sepal.Width_sd = sd(Sepal.Width,
#>             na.rm = TRUE
#>         ),
#>         Petal.Length_sd = sd(Petal.Length,
#>             na.rm = TRUE
#>         ),
#>         Petal.Width_sd = sd(Petal.Width,
#>             na.rm = TRUE
#>         ),
#>         Sepal.Length_iqr = IQR(Sepal.Length,
#>             na.rm = TRUE
#>         ),
#>         Sepal.Width_iqr = IQR(Sepal.Width,
#>             na.rm = TRUE
#>         ),
#>         Petal.Length_iqr = IQR(Petal.Length,
#>             na.rm = TRUE
#>         ),
#>         Petal.Width_iqr = IQR(Petal.Width,
#>             na.rm = TRUE
#>         ),
#>         .groups = "drop"
#>     )
#> 
head(aggregated)
#> # A tibble: 3 × 13
#>   Species  Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
#>   <fct>                <dbl>            <dbl>             <dbl>            <dbl>
#> 1 setosa                5.01             3.43              1.46            0.246
#> 2 versico…              5.94             2.77              4.26            1.33 
#> 3 virgini…              6.59             2.97              5.55            2.03 
#> # ℹ 8 more variables: Sepal.Length_sd <dbl>, Sepal.Width_sd <dbl>,
#> #   Petal.Length_sd <dbl>, Petal.Width_sd <dbl>, Sepal.Length_iqr <dbl>,
#> #   Sepal.Width_iqr <dbl>, Petal.Length_iqr <dbl>, Petal.Width_iqr <dbl>