Professional coffee quality ratings — coffee

Cupping scores for batches of green coffee beans, professionally rated by the Coffee Quality Institute, alongside the growing and processing characteristics of each batch. The dataset is included to demonstrate two multitool workflows: assessing the robustness of a focal effect against arbitrary analytic decisions, and systematically modeling predictors against outcomes across subgroups.

Usage

coffee_quality

Format

A data frame with 1,339 rows and 20 variables:

total_cup_points: Overall quality score (0-100); the sum of the ten sensory ratings.
cupper_points: The cupper's holistic overall rating (0-10); a non-composite outcome.
aroma: Aroma rating (0-10).
flavor: Flavor rating (0-10).
aftertaste: Aftertaste rating (0-10).
acidity: Acidity rating (0-10).
body: Body rating (0-10).
balance: Balance rating (0-10).
uniformity: Cup uniformity rating (0-10)
clean_cup: Clean cup rating (0-10)
sweetness: Sweetness rating (0-10)
species: Coffee species, "Arabica" or "Robusta" (heavily imbalanced toward Arabica).
country_of_origin: Country where the beans were grown.
continent_of_origin: Countries grouped into their respective continents.
variety: Cultivar (e.g., "Bourbon", "Typica", "Caturra"); contains missing values.
processing_method: Post-harvest processing (e.g., "Washed / Wet", "Natural / Dry"); contains missing values.
moisture: Moisture content of the green beans, as a proportion; some entries are 0.
category_one_defects: Count of category-one (primary) green-bean defects.
category_two_defects: Count of category-two (secondary) green-bean defects.
quakers: Count of quakers (unripe beans that fail to roast).
unit_of_measurement: Original unit in which altitude was reported, "m" or "ft"; the source of the unit-conversion errors in the altitude columns.
altitude_low_meters: Lower bound of reported growing altitude, in meters.
altitude_high_meters: Upper bound of reported growing altitude, in meters.
altitude_mean_meters: Mean reported growing altitude, in meters; contains missing values and known unit/entry errors (see Details).

Source

Coffee Quality Institute review pages (January 2018), collected by James LeDoux under the MIT License (https://github.com/jldbc/coffee-quality-database) and distributed via the R for Data Science TidyTuesday project, 2020-07-07 (https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-07-07). See the package's LICENSE.note for the bundled data's copyright and license.

Details

Real-world data-quality issues are deliberately preserved rather than cleaned away, because resolving them is meant to be an explicit decision made inside a pipeline (via add_filters() and add_preprocess()) rather than a hidden one baked into the data. In particular:

altitude_mean_meters retains implausibly large values and metre/foot unit mismatches, so that altitude cleaning becomes a demonstrable fork.
variety and processing_method contain missing values, supporting missing-data and filtering decisions.
species is heavily imbalanced toward Arabica, so it is best used as a restriction decision (Arabica-only vs. all) rather than a balanced subgroup.
One row has a total_cup_points of 0, a clear recording error, retained so that excluding it can itself be shown as a defensible filter.
total_cup_points is the deterministic sum of the ten sensory scores and should not be modeled as an outcome of its own components; use cupper_points as a non-composite outcome instead.

Examples

# A small robustness blueprint: does growing altitude predict cup quality,
# and how sensitive is that to altitude-cleaning and exclusion choices?
coffee_quality |>
  add_filters(altitude_mean_meters < 3000, category_two_defects < 5) |>
  add_variables("altitude", altitude_low_meters, altitude_mean_meters, altitude_high_meters) |>
  add_model("altitude effect", lm(cupper_points ~ {altitude} + moisture))
#> # A tibble: 8 × 6
#>   type      group         code  model_coefs_fn model_fit_fn model_standardize_fn
#>   <chr>     <chr>         <chr> <chr>          <chr>        <chr>               
#> 1 filters   altitude_mea… alti… NA             NA           NA                  
#> 2 filters   altitude_mea… alti… NA             NA           NA                  
#> 3 filters   category_two… cate… NA             NA           NA                  
#> 4 filters   category_two… cate… NA             NA           NA                  
#> 5 variables altitude      alti… NA             NA           NA                  
#> 6 variables altitude      alti… NA             NA           NA                  
#> 7 variables altitude      alti… NA             NA           NA                  
#> 8 models    altitude eff… lm(c… parameters::p… performance… parameters::standar…
# pipe on to expand_decisions() |> analyze_grid() to run the full grid