<- 1:10
bob bob
[1] 1 2 3 4 5 6 7 8 9 10
Kelly Bodwin
October 31, 2024
Random thought today: There are a lot of ways to “check in” on your intermediate objects in R.
It’s definitely good practice and something I have trouble pushing my students to do. Maybe I need to be more deliberate about how to do it.
So, there’s the classic way of just printing it out. This is fine. I tend to peek at my objects this way, except I do the peeking in the console… I can NOT get my students to adopt a workflow that pops between notebook and console though. Maybe it’s not the best.
I also see this in some folks’ code:
I especially see it in plotting with ggplot for some reason:
library(tidyverse)
library(palmerpenguins)
p <- ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(); p
I do not like this at all. Perhaps it’s a bias against semicolons, I thought I left those behind when I gave up on Java. But I don’t like the print statement being hidden on a line with code.
Now, a student taught me this trick, and I think it’s super rad:
But it does get a bit inelegant/cumbersome with multiline code and pipelines in my opinion:
Speaking of pipelines, I’m on the fence about the best way to “check in” on progress of a long pipeline. I tend to just highlight part of the pipeline and Cmd+Enter to run that section. But that’s kinda unreproducible and also gets annoying if I’m doing it many times.
Students tend to delete or comment out segments of pipelines and I do NOT like this, it’s so unwieldy.
magrittr
has a cute pipe %T>%
that means “do this next step but don’t pass its results”, which we can use in conjunction with print()
to check stuff.
It’s almost perfect but the necessity of print()
and the subtlety of the %T>%
pipe (it’s easy to miss) annoy me a bit.
# A tibble: 152 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 1 × 1
n_rows
<int>
1 152
(Honestly, I wish we in the tidyverse sphere used the other magrittr
pipes more. Maybe another mini-post one day…)
x
Finally, you might just use glimpse()
in a pipeline, since it invisibly returns the data frame as well as printing a summary, so it can flow through the pipeline:
Rows: 152
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# A tibble: 1 × 1
n_rows
<int>
1 152
My problem here is simply that I don’t love glimpse()
… if I’m verifying a pipeline step, I’d rather just see the raw data.
Googling around lead me to textreadr::peek()
, which seems to be exactly that:
# remotes::install("trinker/textreadr")
library(textreadr)
penguins %>%
filter(species == "Adelie") %>%
peek() %>%
summarize(n_rows = n())
Table: [152 x 8]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen <NA> <NA> <NA> <NA> <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
.. ... ... ... ... ... ... ... ...
# A tibble: 1 × 1
n_rows
<int>
1 152
It’s not on CRAN anymore (sadface). Also, tibbles get downgraded to data.frames. But still, I like this a lot.
So, no perfect solution for pipelines that I know of. And all these options will also print their output in a rendered qmd/Rmd - so they have the same issue as print debugging in that you have to remember to go back and remove code when you are finished developing.
I think my personal wishlist would be, in no particular order:
A dplyr::peek()
function.
A “print and pass” pipe that could be used in a pipeline without needing a function.
Some kind of interactive tool in Quarto that would let you flag lines to be previewed upon chunk run, without them being printed in a rendered doc.
Thoughts? Ideas?