Writing Reproducible
Research with R
Mirindi Kabangu
You will have written a paper.
No installation. No prior R experience required.
Everything runs live in your browser. You are already set up.
“I’m attaching de-identified data from our phase II RCT — 200 patients randomized to Drug A vs. Drug B. We’re submitting to JCO next month.
I need: - Table 1 comparing baseline characteristics by treatment arm - A figure showing tumor response rates by treatment - A figure showing marker levels by grade and treatment - p-values throughout
Can you have a draft by end of week?
— PI”
You open your laptop. Let’s go.
Before you write a single word of a paper, you need to know your data.
In R, that’s two lines:
Note
trial is a real-format RCT dataset from the gtsummary package. 200 patients. 6 variables. Two treatment arms: Drug A and Drug B.
Run the code below. Then swap glimpse for head and run it again. What variables do you see?
Every time you run code today, ask: what would I write in the paper?
After glimpse(trial), your Methods section now has:
“Data were available for 200 patients randomized to Drug A (n = 107) or Drug B (n = 93). Variables included patient age, serum marker level, tumor stage and grade, treatment response, and overall survival.”
That sentence used to require you to count rows manually in Excel. Now it takes 1 line of R.
Table 1 is the “who are these patients?” table. It appears in nearly every clinical paper ever published.
In Excel: calculate each cell manually. Choose each test manually. Format manually.
In R: one function.
That’s it. But we can make it publication-ready in 6 lines total.
We’ll build it one line at a time — you’ll see every decision.
| Characteristic | N = 2001 |
|---|---|
| Chemotherapy Treatment | |
| Drug A | 98 (49%) |
| Drug B | 102 (51%) |
| Age | 47 (38, 57) |
| Unknown | 11 |
| Marker Level (ng/mL) | 0.64 (0.22, 1.41) |
| Unknown | 10 |
| T Stage | |
| T1 | 53 (27%) |
| T2 | 54 (27%) |
| T3 | 43 (22%) |
| T4 | 50 (25%) |
| Grade | |
| I | 68 (34%) |
| II | 68 (34%) |
| III | 64 (32%) |
| Tumor Response | 61 (32%) |
| Unknown | 7 |
| 1 n (%); Median (Q1, Q3) | |
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Age | 46 (37, 60) | 48 (39, 56) |
| Unknown | 7 | 4 |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Unknown | 6 | 4 |
| T Stage | ||
| T1 | 28 (29%) | 25 (25%) |
| T2 | 25 (26%) | 29 (28%) |
| T3 | 22 (22%) | 21 (21%) |
| T4 | 23 (23%) | 27 (26%) |
| Grade | ||
| I | 35 (36%) | 33 (32%) |
| II | 32 (33%) | 36 (35%) |
| III | 31 (32%) | 33 (32%) |
| Tumor Response | 28 (29%) | 33 (34%) |
| Unknown | 3 | 4 |
| 1 Median (Q1, Q3); n (%) | ||
trial |>
# Step 1: select variables for the table
select(trt, age, marker, stage, grade, response) |>
tbl_summary(
# Step 2: split by treatment arm
by = trt,
# Step 3: report mean ± SD for continuous variables
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
)
)| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Age | 47 (15) | 47 (14) |
| Unknown | 7 | 4 |
| Marker Level (ng/mL) | 1.02 (0.89) | 0.82 (0.83) |
| Unknown | 6 | 4 |
| T Stage | ||
| T1 | 28 (29%) | 25 (25%) |
| T2 | 25 (26%) | 29 (28%) |
| T3 | 22 (22%) | 21 (21%) |
| T4 | 23 (23%) | 27 (26%) |
| Grade | ||
| I | 35 (36%) | 33 (32%) |
| II | 32 (33%) | 36 (35%) |
| III | 31 (32%) | 33 (32%) |
| Tumor Response | 28 (29%) | 33 (34%) |
| Unknown | 3 | 4 |
| 1 Mean (SD); n (%) | ||
trial |>
# Step 1: select variables for the table
select(trt, age, marker, stage, grade, response) |>
tbl_summary(
# Step 2: split by treatment arm
by = trt,
# Step 3: report mean ± SD for continuous variables
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
)
) |>
# Step 4: add p-values (auto test selection)
add_p()| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|
| Age | 47 (15) | 47 (14) | 0.7 |
| Unknown | 7 | 4 | |
| Marker Level (ng/mL) | 1.02 (0.89) | 0.82 (0.83) | 0.085 |
| Unknown | 6 | 4 | |
| T Stage | 0.9 | ||
| T1 | 28 (29%) | 25 (25%) | |
| T2 | 25 (26%) | 29 (28%) | |
| T3 | 22 (22%) | 21 (21%) | |
| T4 | 23 (23%) | 27 (26%) | |
| Grade | 0.9 | ||
| I | 35 (36%) | 33 (32%) | |
| II | 32 (33%) | 36 (35%) | |
| III | 31 (32%) | 33 (32%) | |
| Tumor Response | 28 (29%) | 33 (34%) | 0.5 |
| Unknown | 3 | 4 | |
| 1 Mean (SD); n (%) | |||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | |||
trial |>
# Step 1: select variables for the table
select(trt, age, marker, stage, grade, response) |>
tbl_summary(
# Step 2: split by treatment arm
by = trt,
# Step 3: report mean ± SD for continuous variables
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
)
) |>
# Step 4: add p-values (auto test selection)
add_p() |>
# Step 5: add overall column + polish labels
add_overall() |>
bold_labels() |>
modify_header(label = "**Characteristic**")| Characteristic | Overall N = 2001 |
Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|---|
| Age | 47 (14) | 47 (15) | 47 (14) | 0.7 |
| Unknown | 11 | 7 | 4 | |
| Marker Level (ng/mL) | 0.92 (0.86) | 1.02 (0.89) | 0.82 (0.83) | 0.085 |
| Unknown | 10 | 6 | 4 | |
| T Stage | 0.9 | |||
| T1 | 53 (27%) | 28 (29%) | 25 (25%) | |
| T2 | 54 (27%) | 25 (26%) | 29 (28%) | |
| T3 | 43 (22%) | 22 (22%) | 21 (21%) | |
| T4 | 50 (25%) | 23 (23%) | 27 (26%) | |
| Grade | 0.9 | |||
| I | 68 (34%) | 35 (36%) | 33 (32%) | |
| II | 68 (34%) | 32 (33%) | 36 (35%) | |
| III | 64 (32%) | 31 (32%) | 33 (32%) | |
| Tumor Response | 61 (32%) | 28 (29%) | 33 (34%) | 0.5 |
| Unknown | 7 | 3 | 4 | |
| 1 Mean (SD); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
You just produced Table 1. Here’s what it gives you word-for-word:
“Baseline characteristics were broadly balanced between treatment arms (Table 1). The mean age was 47 years (SD 14) in the Drug A arm and 49 years (SD 14) in the Drug B arm (p = 0.3). Tumor stage and grade distributions did not differ significantly between groups (p > 0.05 for both).”
That paragraph took 5 minutes and 6 lines of R.
In Excel it would have been 30 minutes of copy-pasting, formula-writing, and manual formatting — and you’d have to redo it if a patient was excluded.
Fill in the blanks to reproduce the table. The variable list and label are already there — you just need the by argument and add_p().
Tip
trial |>
select(trt, age, marker, stage, grade, response) |>
tbl_summary(
by = trt,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
)
) |>
add_p() |>
add_overall() |>
bold_labels()
Your PI asked for two figures:
trial |>
group_by(trt) |>
# Step 1: calculate response rate per arm
summarize(
response_rate = mean(response, na.rm = TRUE)
) |>
# Step 2: pipe directly into ggplot
ggplot(aes(x = trt, y = response_rate, fill = trt)) +
# Step 3: add bars
geom_col(width = 0.5) +
# Step 4: convert to % and add labels
geom_text(
aes(label = scales::percent(response_rate, accuracy = 1)),
vjust = -0.5,
fontface = "bold"
) +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.75)
)trial |>
group_by(trt) |>
# Step 1: calculate response rate per arm
summarize(
response_rate = mean(response, na.rm = TRUE)
) |>
# Step 2: pipe directly into ggplot
ggplot(aes(x = trt, y = response_rate, fill = trt)) +
# Step 3: add bars
geom_col(width = 0.5) +
# Step 4: convert to % and add labels
geom_text(
aes(label = scales::percent(response_rate, accuracy = 1)),
vjust = -0.5,
fontface = "bold"
) +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.75)
) +
# Step 5: publication-ready theme and labels
scale_fill_manual(values = c("#50B5AF", "#F79323")) +
labs(
title = "Figure 1. Treatment Response Rate by Arm",
x = "Treatment",
y = "Response Rate",
caption = "Drug A vs. Drug B — trial dataset"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")Look at Figure 1. Here’s the sentence it produces:
“The overall response rate was 28% in the Drug A arm compared to 35% in the Drug B arm. Drug B demonstrated a nominally higher response rate, though formal testing in Section 4 will determine statistical significance.”
The figure does the heavy lifting.
A reader scanning your paper sees that bar chart before they read a single word of Results. Make it clear and labeled — ggplot2 just did that for you automatically.
trial |>
ggplot(aes(x = grade, y = marker, fill = trt)) +
# Step 1: raw marker data by treatment
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
# Step 2: add individual data points
geom_jitter(
aes(color = trt),
width = 0.15,
alpha = 0.5,
size = 1.5
) +
# Step 3: publication-ready polish
scale_fill_manual(values = c("#50B5AF", "#F79323")) +
scale_color_manual(values = c("#50B5AF", "#F79323")) +
labs(
title = "Figure 2. Marker Level by Tumor Grade and Treatment",
x = "Tumor Grade",
y = "Marker Level",
fill = "Treatment",
color = "Treatment",
caption = "Boxes show IQR; points show individual patients"
) +
theme_minimal(base_size = 14)Look at Figure 2. Here’s what you’d write:
“Marker levels were broadly similar across tumor grades in both treatment arms (Figure 2). Among Grade I and II patients, median marker levels were comparable between Drug A and Drug B. Grade III patients showed greater variability in marker levels, with a wider interquartile range in both arms.”
You didn’t touch a formula.
No =STDEV(). No pivot table. No manually filtering by grade. One ggplot call split the data, colored it by group, and showed you the distribution — automatically.
Using trial, make a boxplot of marker levels by stage (not grade this time), filled by trt. Add axis labels and a title.
Hint: geom_boxplot(aes(x = stage, y = marker, fill = trt))
Tip
trial |>
ggplot(aes(x = stage, y = marker, fill = trt)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Marker Level by Tumor Stage and Treatment",
x = "Tumor Stage",
y = "Marker Level",
fill = "Treatment"
) +
theme_minimal()
You’ve described the data (Table 1) and shown it visually (Figures 1–2).
Now you need to formally answer the research question:
“Is Drug A more effective than Drug B?”
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|
| Treatment Response | 28 (29%) | 33 (34%) | 0.5 |
| Unknown | 3 | 4 | |
| 1 n (%) | |||
| 2 Pearson’s Chi-squared test | |||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|
| Marker Level | 1.02 (0.89) | 0.82 (0.83) | 0.085 |
| Unknown | 6 | 4 | |
| 1 Mean (SD) | |||
| 2 Wilcoxon rank sum test | |||
# Combine into one clean results table
trial |>
select(trt, response, marker, grade, stage) |>
tbl_summary(
by = trt,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
label = list(
response ~ "Treatment Response",
marker ~ "Marker Level",
grade ~ "Tumor Grade",
stage ~ "T Stage"
)
) |>
add_p() |>
add_overall() |>
bold_labels() |>
modify_header(label = "**Outcome / Variable**")| Outcome / Variable | Overall N = 2001 |
Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|---|
| Treatment Response | 61 (32%) | 28 (29%) | 33 (34%) | 0.5 |
| Unknown | 7 | 3 | 4 | |
| Marker Level | 0.92 (0.86) | 1.02 (0.89) | 0.82 (0.83) | 0.085 |
| Unknown | 10 | 6 | 4 | |
| Tumor Grade | 0.9 | |||
| I | 68 (34%) | 35 (36%) | 33 (32%) | |
| II | 68 (34%) | 32 (33%) | 36 (35%) | |
| III | 64 (32%) | 31 (32%) | 33 (32%) | |
| T Stage | 0.9 | |||
| T1 | 53 (27%) | 28 (29%) | 25 (25%) | |
| T2 | 54 (27%) | 25 (26%) | 29 (28%) | |
| T3 | 43 (22%) | 22 (22%) | 21 (21%) | |
| T4 | 50 (25%) | 23 (23%) | 27 (26%) | |
| 1 n (%); Mean (SD) | ||||
| 2 Pearson’s Chi-squared test; Wilcoxon rank sum test | ||||
Your Results section now writes itself from the output:
“Treatment Response. The response rate was 28% (30/107) in the Drug A arm vs. 35% (33/93) in the Drug B arm. This difference was not statistically significant (p = 0.2).”
“Biomarker Levels. Mean marker level was 0.92 ng/mL (SD 0.86) with Drug A and 0.86 ng/mL (SD 0.80) with Drug B. No significant difference was observed between arms (p = 0.6).”
What does this tell us about the trial?
Neither the response rate nor the biomarker differed significantly between arms. In a real paper, this shapes your Discussion — Drug B showed a numerically higher response rate but the study may have been underpowered to detect a difference.
Filter for Grade III patients only, then build a tbl_summary() comparing response and marker by trt, with p-values.
This is a subgroup analysis — one of the most common requests in clinical research.
Tip
trial |>
filter(grade == "III") |>
select(trt, response, marker) |>
tbl_summary(
by = trt,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
)
) |>
add_p() |>
bold_labels()
You’ve now produced:
All of it came from one dataset. All of it is reproducible. If the data changes — one row excluded, one variable added — you re-run the script and every table and figure updates automatically.
No blanks. No hints. Your PI just emailed back:
“Can you also look at whether marker level predicts response, stratified by grade? Just a quick summary — whatever you think makes sense.”
Use trial. Use any combination of filter(), group_by(), tbl_summary(), or ggplot(). There is no single right answer.
One possible approach
# Option 1: summary table — marker by response, stratified by grade
trial |>
group_by(grade) |>
tbl_summary(
by = response,
include = marker,
statistic = all_continuous() ~ "{mean} ({sd})",
label = marker ~ "Marker Level"
) |>
add_p() |>
bold_labels()
Tables and figures update when you re-run. But what about the sentences in your paper?
This is the most common way reproducibility breaks down in practice:
“The mean age was 47 years…”
Six months later, you exclude 3 patients. You re-run R. The tables update. But that 47 in your Methods section? Still there. Manually. Wrong.
Hard-coded numbers in prose are a reproducibility leak.
Every time you type a number into a sentence, you create a place where the paper and the data can silently disagree.
In a Quarto document, you can embed live R expressions directly inside your prose using `r`:
When the document renders, that becomes:
“The mean age was 47.2 years (SD 14.3).”
Re-run on updated data? The sentence updates automatically. The number in the prose and the number in the table are guaranteed to match — they come from the same computation.
Renders as:
“A total of 200 patients were enrolled, randomized to Drug A (n = 107) or Drug B (n = 93).”
A total of ` r nrow(trial)` patients were enrolled, randomized to
Drug A (n = ` r sum(trial$trt == "Drug A")`) or
Drug B (n = ` r sum(trial$trt == "Drug B")`).
The response rate was
` r scales::percent(mean(trial$response[trial$trt=="Drug A"], na.rm=TRUE), accuracy=1)`
in the Drug A arm vs.
` r scales::percent(mean(trial$response[trial$trt=="Drug B"], na.rm=TRUE), accuracy=1)`
in the Drug B arm.Renders as:
“The response rate was 28% in the Drug A arm vs. 35% in the Drug B arm.”
The full power of reproducible research.
Change the dataset → re-render the document → every number in every sentence updates.
Tables ✅ Figures ✅ Prose ✅
This is what separates a reproducible paper from a paper that just happens to have code attached.
`r `Update the data → re-render → everything updates. Tables, figures, and the sentences that describe them.
Tonight: Go to posit.cloud — free RStudio in your browser. Create a project.
This week: Run glimpse() on any dataset you’ve been meaning to analyze. Paste it into ChatGPT or Claude and say “help me write a tbl_summary for this.” You’ll have a Table 1 draft in 10 minutes.
This month: Replace one Excel analysis with R. Just one. That’s the habit.
| Resource | What it’s for |
|---|---|
| r4ds.hadley.nz | Free textbook |
| posit.cloud | Free RStudio in your browser |
| medicaldata | Free clinical datasets |
| rmrwr | Free clinical r textbook |
| DART Program | Free hands-on tutorials |
Resources:
One last thing
If you open R this week and run glimpse() on your own data — even once — this session will have done its job.
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gtsummary_2.5.0 medicaldata_0.2.0 ggplot2_4.0.2 dplyr_1.2.0
[5] tidyr_1.3.2
loaded via a namespace (and not attached):
[1] gt_1.3.0 sass_0.4.10 utf8_1.2.6 generics_0.1.4
[5] xml2_1.5.2 stringi_1.8.7 digest_0.6.39 magrittr_2.0.4
[9] evaluate_1.0.5 grid_4.5.3 RColorBrewer_1.1-3 cards_0.7.1
[13] fastmap_1.2.0 jsonlite_2.0.0 cardx_0.3.2 backports_1.5.0
[17] purrr_1.2.1 scales_1.4.0 cli_3.6.5 rlang_1.1.7
[21] litedown_0.9 commonmark_2.0.0 base64enc_0.1-6 withr_3.0.2
[25] yaml_2.3.12 otel_0.2.0 tools_4.5.3 broom_1.0.12
[29] vctrs_0.7.2 R6_2.6.1 lifecycle_1.0.5 stringr_1.6.0
[33] fs_1.6.7 pkgconfig_2.0.3 pillar_1.11.1 gtable_0.3.6
[37] glue_1.8.0 xfun_0.57 tibble_3.3.1 tidyselect_1.2.1
[41] knitr_1.51 farver_2.1.2 htmltools_0.5.9 rmarkdown_2.30
[45] labeling_0.4.3 compiler_4.5.3 S7_0.2.1 markdown_2.0
SNMA 2026 | Pittsburgh, PA