ggplot(
-data = surv,
- mapping = aes(
- x = district,
- y = sex)+
- geom_histogram()
Creating reports with R and MS Excel: a tutorial using the openxls2 package (EN)
+Overview
+Case study characteristics | ++ |
---|---|
Name: | +openxlsx2 tutorial | +
Language: | +English | +
Tools: | +R, MS Excel | +
Location: | +N/A | +
Scale: | +N/A | +
Diseases: | +N/A | +
Keywords: | +R, Excel, Report, Export, Format, openxls2, Tutorial | +
Technical complexity: | +Intermediate | +
Methodological complexity: | +Intermediate | +
Authorship
+Original authors: Leonel Lerebours and Alberto Mateo Urdiales
+Data source: None (Example data will be generated with R)
+
Instructions
+Getting Help
+There are several ways to get help:
+-
+
- Look for the “hints” and solutions (see below) +
- Post a question in Applied Epi Community with reference to this case study +
Hints and Solutions
+Here is what the “helpers” look like:
+ + +
+ Click to read a hint
+
+Here you will see a helpful hint!
+
+Click to see the solution
+
+%>%
+ ebola_linelist filter(
+ > 25,
+ age == "Bolo"
+ district )
Here is more explanation about why the solution works.
+Posting a question in the Community Forum
+… description here about posting in Community… TO BE COMPLETED BY APPLIED EPI
+You will see these icons throughout the exercises:
+Icon | +Meaning | +
---|---|
+ | Observe | +
+ | Alert! | +
+ | An informative note | +
+ | Time for you to code! | +
+ | Change to another window | +
+ | Remember this for later | +
Terms of Use
+-
+
- You may use the tutorial to learn how to generate reports using R by creating tables and exporting them in MS Excel for visualization with the
openxlsx2
package, for educational purposes, and to apply the learned techniques to your personal or professional projects. This tutorial might be freely translated, copied, or distributed. No warranty is made or implied for use of the software for any particular purpose.
+
Feedback & suggestions
+-
+
- You can write feedback and suggestions on this tutorial at the GitHub issues page +
- Alternatively email us at: contact@appliedepi.org +
Version and revisions
+Version 1
+July 27, 2024
+Disclaimer
+-
+
The main focus of the tutorial is to use the core functions of the
openxlsx2
up to the version 1.8You must have install MS Excel (or software equivalent like LibreOffice) to visualize the output tables.
+The data for this tutorial will be generated randomly (any resemble with real data is totally coincidence).
+
Date | +Changes made | +Author | +
---|---|---|
2024-07-27 | +None (first version) | +Leonel Lerebours | +
+ | + | + |
+ | + | + |
Guidance
+Objectives of this case study
+The goal of this tutorial is to introduce you in the use of openxlsx2 to export formatted tables in MS Excel.
+Previous level of expertise assumed
+Add information of previous expertise needed to follow the case study. This includes expertise on:
+-
+
- Is recommended to intermediate R skills and have basic or beyond knowledge of
dplyr
(from tidyverse package) like pipe operators and data wrangling. Here some reference.
+
-
+
- Epidemiological experience (e.g., knowledge of how to design outputs tables for reporting purpose) +
Preparation for the case study
+-
+
Install the
openxlsx2
package (directly from Rstudio or here)
+You must have installed MS Excel (or software equivalent like LibreOffice) to visualize the output tables.
+
Why use MS Excel for reporting ?
+Excel is one of the most use software for data analysis and also for visualization and many other capabilities. Since Excel’s formatting options allow users to adjust fonts, colors, borders, and alignment to create visually appealing reports, with very easy knowledge, is very common in many areas including epidemiological task that is use as reporting tool.
+In some way is undenstandable since it let you to “interact” with the data shown, for example we want to do quick calculations out a summary table, or modify a graph and to compare with other previous reports.
+If you had or work doing periodical reporting like working with epidemiological surveillance, probably you or a co-worker use or had use in some point Excel or any other spreadsheet software like libreoffice to present tables and summaries.
+However, even with all the great perks that Excel has, somewhat is hard to automate a report with Excel even using a template with a pre-designed format. Is also time consuming every time you create a table or a graph (more if you don’t know how to use macros) and edit a spreadsheet. If you add-up all the time that takes to format borders, re-size a column or change a font size, it probably will suprise you that is a lot.
+Automating a report in Excel with R using openxlsx2 package
+As stand in the CRAN page of the openxlsx2 the main purpose of this package is:
+“Simplifies the creation of ‘xlsx’ files by providing a high level interface to writing, styling and editing worksheets.”
+In this short tutorial we are going to create and format a summary report from scratch in R without touching Excel or any other spreadsheet software.
+First step: The Data
+Before start using the functions of openxlsx2, we need to beforehand get what are the elements we want into the exported report, how many tables, type of tables (aggregate data or a linelists).
+For this tutorial, the scenario is to do a summary of the production of laboratories (sample received, confirmed by year and months).
+The source data for this you can create dummy data, no mater the source, could be in Excel or other spreadsheet software using formulas to generate random numbers (or with R too) using the following variables:
+-
+
Date: from a range you want, I use 3 years.
+Laboratories: a categorical variable with the values from “A” to “E”.
+Number of samples received: a numerical variable (ramdon number from 0 to 100 for example).
+Number of samples with positive results: a numerical variable less than the previous variable; I suggest a proportion range of the samples received like 2% - 5%.
+
Just to start right away, let’s create a dataframe object with 1,000 observations using tidyverse with the following code:
+library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ dplyr 1.1.4 ✔ readr 2.1.5
+✔ forcats 1.0.0 ✔ stringr 1.5.1
+✔ ggplot2 3.5.1 ✔ tibble 3.2.1
+✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+✔ purrr 1.0.2
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+set.seed(1300) # to replicate the same or similar dataframe
+
+# A dataframe for the example using 1000 observations
+<- tibble(
+ bd
+ date=sample(seq(as.Date("2022-01-01"), # Random dates
+ as.Date("2024-12-31"), by="day"),
+ replace = T,
+ 1000),
+
+ laboratories=sample(LETTERS[c(1:5)], # Random Labs (A to E)
+ replace = T,
+ 1000),
+
+
+ n_samples=sample(c(1:40), # Random samples (by day and lab)
+ replace = T,
+ 1000)) %>%
+
+ mutate(
+
+ n_confirmed=round(sample(seq( #Random confirmed samples
+ from=0.01,
+ to=0.05,
+ by=0.001),
+ replace=T,
+ 1000)*n_samples,0),
+
+ pct_confirmed=n_confirmed/n_samples # % positivity
+
+
+ )
+head(bd)
# A tibble: 6 × 5
+ date laboratories n_samples n_confirmed pct_confirmed
+ <date> <chr> <int> <dbl> <dbl>
+1 2024-09-04 B 15 1 0.0667
+2 2023-02-27 C 26 1 0.0385
+3 2023-06-02 A 24 0 0
+4 2023-02-19 B 1 0 0
+5 2024-06-15 D 29 1 0.0345
+6 2022-06-24 C 39 2 0.0513
+Second step: transfoming the data: creating the summary tables
+How you can see, with this simple dummy dataframe of 4 variables we want to know :
+-
+
How many samples were reported by month.
+What is the proportion of the confirmed samples by month.
+What is the proportion of samples reported by laboratory.
+The overall positivity rate by laboratory.
+
To obtain the information we will create various summary tables.
+Here is he code to create three summary tables:
+Note: In case you created a dummy database with Excel, (or you want to use your data) you have to add a line to import your file. You can use the rio package with import() fucntion or openxlsx2 package with read_xlsx() function.
+#install.packages(pacman) #uncomment this line if you not have installed pacman (package manager)
+
+library(pacman)
+
+
+p_load(tidyverse, #to wangle data
+#to do tables
+ janitor, #to do the report and exported in MS Excel format
+ openxlsx2)
+
+#table for samples by year and months
+
+<- bd %>%
+ total_sample_tab
+ mutate(months=month(as.Date(date),label=T),
+ years=year(as.Date(date))) %>%
+
+ group_by(years, months) %>%
+
+ reframe(tot_samples=sum(n_samples)) %>%
+
+ rename("year of reporting"=years) %>%
+
+ pivot_wider(names_from = months,
+ values_from = tot_samples,
+ values_fill = 0) %>%
+
+ adorn_totals(c("col", "row"))
+
+
+#table for positivity of samples by year and months
+
+
+#main part of the table 2
+<- bd %>%
+ positivity_a
+ mutate(months=month(as.Date(date),label=T),
+ years=as.character(year(as.Date(date)))) %>%
+
+ group_by(years, months) %>%
+
+ reframe(tot_samples=sum(n_samples),
+ tot_confirmed=sum(n_confirmed),
+ pct=tot_confirmed/tot_samples) %>%
+
+ select(years, months, pct) %>%
+
+ pivot_wider(names_from = months,
+ values_from = pct,
+ values_fill = 0)
+
+#last row of table 2
+<- bd %>%
+ positivity_b
+ mutate(months=month(as.Date(date),label=T),
+ years=year(as.Date(date))) %>%
+
+ group_by(months) %>%
+
+ reframe(tot_samples=sum(n_samples),
+ tot_confirmed=sum(n_confirmed),
+ pct=tot_confirmed/tot_samples,
+ years="Monthly Pos.") %>%
+
+ select(years, months, pct) %>%
+
+ pivot_wider(names_from = months,
+ values_from = pct,
+ values_fill = 0)
+
+#last column of table 2
+<- bd %>%
+ positivity_c
+ mutate(months=month(as.Date(date),label=T),
+ years=as.character(year(as.Date(date)))) %>%
+
+ group_by(years) %>%
+
+ reframe(tot_samples=sum(n_samples),
+ tot_confirmed=sum(n_confirmed),
+ pct=tot_confirmed/tot_samples) %>%
+
+ select(years, year_pct=pct)
+
+<- bind_rows(positivity_a, positivity_b) %>%
+ positivity_tab left_join(positivity_c, by="years") %>%
+ rename("year of reporting"=years,
+ "Yearly pos."=year_pct)
+
+
+<- bd %>%
+ laboratory_summary
+ mutate(years=year(date)) %>%
+
+ group_by(years, laboratories) %>%
+
+ reframe(total_samples=sum(n_samples),
+ total_confirmed=sum(n_confirmed)) %>%
+
+ arrange(laboratories, years) %>%
+
+ adorn_totals("row") %>%
+
+mutate(positivity_rate=total_confirmed/total_samples) %>%
+
+ rename("Years of reporting"=years,
+ "Laboratories"=laboratories,
+ "Total Samples"=total_samples,
+ "Confirmed Samples"=total_confirmed,
+ "% of confirmed samples"=positivity_rate)
Now that we have the tables for the summary report, lets do a overview of the main functions of openxlsx2 :
+Main functions
+-
+
wb_workbook(): to create a new workbook
+wb_add_worksheet(): to add worksheets (name, zoom level and gridlines)
+wb_add_data(): to add either a dataframe, a table, text string a single value
+wb_save(): to export the workbook to a file (Excel format)
+wb_open(): really handy to open right away the workbook in Excel (to see the results of the code)
+