# How to create an epidemiological report?

For most of the following R code I have used the The Epidemiologists R Handbook. It is a great resource to learn many different skills not solely useful to an epidemiologist. First of all, it provides a good introduction to R markdown and tidy data including a lot of examples how to reshape your data and convert variables in order to readily use it/them for your figures, tables and analysis.

Before we start I need you to install the some packages. Please just copy & paste the following text into the console:

install.packages("utils") install.packages("tidyverse") install.packages("ggplot2") install.packages("tidyquant") install.packages("knitr") install.packages("scales") install.packages("officer") install.packages("flextables") install.packages("systemfonts") install.packages("flexdashboard")

Now we need some data to actually produce a report. As an example we upload the EU Covid-19 case data of the European Centre for Disease Prevention and Control from the following websites: for daily cases in the EU and for cases in the different age groups.

Within the code chunks we can use the hashtag # for commenting on the code we write. I advise you to use English as a working language for this. Simply because the code is also in English. And it is extremely important to comment on the code you write, because this will make it easier to follow what you have been doing in the past.

#these libraries need to be loaded
library(utils)

#read the dataset sheet for the daily case numbers in the EU into “R”. The dataset will be called "covid19".
covid19 &lt;- read.csv(&quot;https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv&quot;, na.strings = &quot;&quot;, fileEncoding = &quot;UTF-8-BOM&quot;)

str(covid19) # check if the data has been loaded correctly
head(covid19) # show the to 6 lines of the data frame

#read the dataset sheet for the cases in the different age groups in the EU into “R”. The dataset will be called &quot;covid19_age&quot;.
covid19_age&lt;-read.csv(&quot;https://opendata.ecdc.europa.eu/covid19/agecasesnational/csv/data.csv&quot;, na.strings = &quot;&quot;, fileEncoding = &quot;UTF-8-BOM&quot;)

str(covid19_age) # check if the data has been loaded correctly
head(covid19_age) # show the to 6 lines of the data frame



## Putting plots into your report

You can easily add high-resolution graphs and tables to your report. With the help of the packages ggplot2 and tidyverse you will be able to produce very nice figures to include in your report. First, we need to reshape part of our data and convert some variables.

# load package
library(tidyverse) #to reshape and convert data

# convert dateRep to class date and rename certain columns
covid19 %
mutate(dateRep = as.Date(dateRep, format = "%d/%m/%Y")) %&gt;% rename(country = "countriesAndTerritories", continent = "continentExp")
covid19$prop.dead&lt;-covid19$deaths/covid19$cases #calculate proportion of deaths per day covid19 as.Date("2021-03-31")) #set the start date for your report to 31st March 2021 str(covid19) #have quick look at your data if everything is there as planned # create a data set only for Germany covid19Germany &lt;- filter(covid19, country==&quot;Germany&quot;)  Now we have two data sets ready to be used for making some figures. First, let us produce a figure for the COVID-19 cases in Germany from March 2021 to last week. #load packages library(ggplot2) #to make nice figures library(tidyquant) #to quickly produce a rolling 7 day average # create the ggplot for the number of cases and deaths in Germany from 31st March to today cols {r Corona data EU, echo=FALSE}...**. Then only the figure is shown in the report: {r covid19 data EU, echo=FALSE} #echo=FALSE will only put the following figure in the report without the code chunk # create the ggplot for the number of cases and deaths for all countries in the EU from 31st March to today ggplot(data = covid19)+ geom_col( mapping = aes(x = dateRep, y = cases, colour="Cases", fill="Cases"), width = 1)+ # for daily counts, set width = 1 to avoid white space between bars geom_col( mapping = aes(x = dateRep, y = deaths, colour="Deaths", fill="Deaths"), width = 1)+ # for daily counts, set width = 1 to avoid white space between bars # create facets for each EU country facet_wrap( # split your plot into several plots to show the epi curves for each EU country ~country, # define for which variable you want to produce facets ncol = 5, # number of columns for the arrangement of each small plot strip.position = "top")+ # define where the name of the country should be added to each small plot scale_fill_manual(name="",values=cols)+ scale_colour_manual(name="",values=cols)+ # labels for x-axis scale_x_date( date_breaks = "2 months", # labels every 2 months date_minor_breaks = "1 month", # gridlines every month date_labels = '%b')+ # labelled by month with the year below theme_minimal()+ labs(x = "Date of report", y = "Number of cases/deaths", title = "Daily cases and death numbers in 2021")  #load package library(scales) #to convert scientific numbers to decimal numbers # case numbers for different age groups in the EU ggplot(data = covid19_age)+ geom_col( mapping = aes(x = year_week, y = new_cases,fill=age_group), width = 1)+ # for daily counts, set width = 1 to avoid white space between bars theme_minimal()+ # rotate labels for x-axis theme(axis.text.x = element_text(angle = 90, size = 5))+ #even though still a bit ugly, but with this year_week numbers will be displayed perpendicular to the x-axis, size defines the font size # make y-axis numbers non-scientific scale_y_continuous(labels=comma)+ #convert scientific numbers on the y-axis to "normal" numbers labs(x = "Week of report", y = "Number of cases", title = "Weekly case numbers per age group in the EU", fill="Age group") covid19_age_incidence_means% group_by(year_week,age_group) %&gt;% summarise(mean_14d_incidence=mean(rate_14_day_per_100k,na.rm=T)) # create a tibble with mean 14 day incidence for each age group across all EU countries ggplot(data = covid19_age_incidence_means)+ geom_col( mapping = aes(x = year_week, y = mean_14d_incidence, fill=age_group), width = 1)+ # create facets facet_wrap( # split your plot into several plots to show the epi curves for each age group ~age_group, # define for which variable you want to produce facets ncol = 3, # number of columns for the arrangement of each small plot strip.position = "top")+ # define where the age group should be added to each small plot theme_minimal()+ # rotate labels for x-axis theme(axis.text.x = element_text(angle = 90, size = 5))+ # make y-axis numbers non-scientific scale_y_continuous(labels=comma)+ labs(x = "Date of report", y = "Number of cases per 100'000 inhabitants", title = "14-day incidence rate per age group in the EU", fill="Age group")  ## Case fatality rate CFR The case fatality rate is one measure to investigate the mortality of an infectious disease. Crudely calculated it is the number of deaths in relation to the number of confirmed cases. The CFR can be calculated for different age groups, populations, geographic regions, nations, time periods etc. and can so serve as a comparative "index" for The crude CFR: $$CFR= \frac{\sum _{global\ deaths}}{\sum _{global\ cases}}$$ To construct a 95% confidence interval for a crude CFR, the following formulae can be used: $$Upper Limit = \frac{100}n*(d + 1.96 * \sqrt{d})$$ $$Lower Limit = \frac{100}n*(d – 1.96 * \sqrt{d})$$$d$= number of deaths upon which the CFR is based$n$= number of confirmed positive cases The infection fatality rate is somewhat harder to calculate because here we need some estimate for the number of mild cases that are most likely undetected or unreported. Seroprevalence studies measuring the number of individuals with antibodies in their have estimated the number of all potential COVID-19 cases during local outbreaks and defined the IFR of COVID-19 to be between 0.6 and 0.7%. The RKI estimates the IFR for Germany at 0.8%. Just to give some perspective, of the ~50’000 individuals that currently infect themselves every day ~400 will die. To calculate the crude CFR for Germany we use a different data set, which we will also use for calculating the R~t7~: #get the data for the weekly cases in the EU weekly_cases&lt;-read.csv(&quot;https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv/data.csv&quot;) str(weekly_cases) #reshape the data set infections&lt;-filter(weekly_cases, indicator == &quot;cases&quot;) #subset for confirmed cases dead&lt;-filter(weekly_cases, indicator == &quot;deaths&quot;) #subset for confirmed deaths weekly_numbers&lt;-full_join(infections, dead, by=c(&quot;country&quot;,&quot;country_code&quot;,&quot;continent&quot;,&quot;population&quot;,&quot;year_week&quot;,&quot;source&quot;)) #merge data sets names(weekly_numbers)&lt;-c(&quot;country&quot;,&quot;country_code&quot;,&quot;continent&quot;,&quot;population&quot;,&quot;indicator.x&quot;,&quot;cases&quot;,&quot;year_week&quot;,&quot;case_rate_14_day&quot;,&quot;cumulative_cases&quot;, &quot;source&quot; ,&quot;indicator.y&quot;,&quot;deaths&quot;,&quot;death_rate_14_day&quot;,&quot;cumulative_deaths&quot;) #rename columns  The crude CFR since the start of the pandemic is roughly r round(sum(weekly_numbers$deaths[which(weekly_numbers$country ==&quot;Germany&quot;)]/sum(weekly_numbers$cases[which(weekly_numbers$country ==&quot;Germany&quot;)]))*100,digits=3)% with a confidence interval of r round((100/sum(weekly_numbers$cases[which(weekly_numbers$country ==&quot;Germany&quot;)]))*(sum(weekly_numbers$deaths[which(weekly_numbers$country ==&quot;Germany&quot;)]) - (1.96*sqrt(sum(weekly_numbers$deaths[which(weekly_numbers$country ==&quot;Germany&quot;)])))),digits=3)% to r round((100/sum(weekly_numbers$cases[which(weekly_numbers$country ==&quot;Germany&quot;)]))*(sum(weekly_numbers$deaths[which(weekly_numbers$country ==&quot;Germany&quot;)]) + (1.96*sqrt(sum(weekly_numbers$deaths[which(weekly_numbers$country ==&quot;Germany&quot;)])))),digits=3)%. The function round() can be used to round numbers and define the number of decimals with, e.g. the code for rounding x to 3 decimals is as follows round(x, digits=3). Probably we are interested in the cases, deaths and the CFR of all EU countries. With the function flextable() from the package with same name we can produce a neat table for comparing case numbers: #produce a tibble with covid19 case counts for EU country including CFR and confidence intervals covid19_counts% group_by(country) %&gt;% summarise(case_count=sum(cases,na.rm=T),death_count=sum(deaths,na.rm=T)) covid19_counts$recovered&lt;-covid19_counts$case_count-covid19_counts$death_count #calculate the number of recovered
covid19_counts$CFR&lt;-round(covid19_counts$death_count/covid19_counts$case_count*100,digits=3) #calculate CFR per country covid19_counts$lower_CI&lt;-round((100/covid19_counts$case_count)*(covid19_counts$death_count - (1.96*sqrt(covid19_counts$death_count))),digits=3) #calculate upper confidence interval covid19_counts$upper_CI%              # table is piped in from above
top = TRUE,                            # New header goes on top of existing header row
values = c("Country",                  # Header values for each column below
"Total confirmed cases",
"Deaths",
"Recovered",
"Mortality",                # This will be the top-level header for this and two next columns
"",
"")) %&gt;%
country = "",
case_count = "",
death_count = "",
recovered = "",
CFR = "CFR %",                       # This goes below the top-level header "Mortality" we defined above
lower_CI = "lower 95% CI",           # This goes below the top-level header "Mortality" we defined above
upper_CI = "upper 95% CI")  %&gt;%      # This goes below the top-level header "Mortality" we defined above
merge_at(i = 1, j = 5:7, part = "header") %&gt;% # Horizontally merge columns 5 to 7 in new header row
border_remove() %&gt;%
theme_booktabs() %&gt;%
vline(part = "all", j = 2, border = border_style) %&gt;%   # draw vertical line at column 2
vline(part = "all", j = 4, border = border_style) %&gt;%   # draw vertical line at column 4
merge_at(i = 1:2, j = 1, part = "header") %&gt;%           # vertically merge column 1
merge_at(i = 1:2, j = 2, part = "header") %&gt;%           # vertically merge column 2
merge_at(i = 1:2, j = 3, part = "header") %&gt;%           # vertically merge column 3
merge_at(i = 1:2, j = 4, part = "header") %&gt;%           # vertically merge column 4
width(j=c(1,2), width = 1.2) %&gt;%                        # define column width for column 1 and 2
width(j=c(3,4), width = 1) %&gt;%                          # define column width for column 3 and 4
width(j=c(5,6,7), width = 0.6) %&gt;%                      # define column width for columns 5 to 7
flextable::align(., align = "center", j = c(2:7), part = "all") %&gt;%  # center all text
bg(., part = "body", bg = "gray95")  %&gt;%                # make background of table body grey
bg(., j=c(1:7), i= ~ country == "Germany", part = "body", bg = "#91c293") %&gt;%   # make background of line with Germany green
bold(i = 1, bold = TRUE, part = "header") %&gt;%           # make header font bold
bold(i = 11, bold = TRUE, part = "body")                # make font bold in line 11 of the table body



If we compare the CFRs from both data sets, we immediately realise that they differ dramatically. Over the complete pandemic the CFR is r round(sum(weekly_numbers$deaths[which(weekly_numbers$country =="Germany")]/sum(weekly_numbers$cases[which(weekly_numbers$country =="Germany")]))*100,digits=3)% with a confidence interval of r round((100/sum(weekly_numbers$cases[which(weekly_numbers$country =="Germany")]))*(sum(weekly_numbers$deaths[which(weekly_numbers$country =="Germany")]) - (1.96*sqrt(sum(weekly_numbers$deaths[which(weekly_numbers$country =="Germany")])))),digits=3)% to r round((100/sum(weekly_numbers$cases[which(weekly_numbers$country =="Germany")]))*(sum(weekly_numbers$deaths[which(weekly_numbers$country =="Germany")]) + (1.96*sqrt(sum(weekly_numbers$deaths[which(weekly_numbers$country =="Germany")])))),digits=3)% while the CFR from March 2021 to last week is much lower, i.e. r covid19_counts$CFR[which(covid19_counts$country=="Germany")]% (95% CI: r covid19_counts$lower_CI[which(covid19_counts$country=="Germany")]%, r covid19_counts$upper_CI[which(covid19_counts$country=="Germany")]% ). This decrease is a potential effect of the public health measures, such as masks, social distancing and vaccines. But such a claim would need to be properly analysed, which is beyond the scope of this course.

## Reproduction number R~0~ and R~t7~

The basic reproduction number is only used in theoretical models and is the product of the average of infection-producing contacts per unit time $\beta$ and the mean infectious period of $\tau$:

$$R_{~0~} = \beta\tau$$

During a pandemic the effective reproduction number constantly changes, especially if health measure are put in place and depends on the local behaviour of the host. Therefore, early in the pandemic the effective reproduction number over seven days was calculated to have some idea how and whether health measures had any effect on the pandemic.

This effective reproduction number can be calculated as follows:

$$R_{t7}= \frac{\sum _{cases\ of\ the\ last\ week}}{\sum _{cases\ of\ the\ week\ before}}$$

For example, the R~t7~ value for week 39 was r round(weekly_numbers$cases[which(weekly_numbers$country =="Germany")][40]/weekly_numbers$cases[which(weekly_numbers$country =="Germany")][39],digits=3) in Germany.

Very important to note here: I am not an epidemiologist, most of these calculations, including this R~t7~ value, are very crude measures. In reality, CFR, IFR, R~t7~ and R0 are not the same for each age group and population. Therefore, proper epidemiologists take such aspects into account and correct or scale for them in their models. However, this is beyond the scope of this course.

How something like this is done correctly you can read up in Yamauchi et al. 2019. I will elaborate on these calculations in the epidemiology lectures. With the present example I just wanted to introduce you to R markdown and how you can easily write a script to produce a pretty shiny report for the COVID-19 case numbers in Germany.

On side note: How you can generate such nice formulas in R markdown is beyond the scope of this course. But if you are interested visit the following website.

## Now it is your turn to practice

Please try some scripting and writing yourself. The best way to get used to R and its packages is to use the code and change parameters in order to see and learn what the code stands for.

Therefore I would like you to do the following excercises:

1. Give the .rmd file a new name and save it. Then change the author name to your name and generate a report for another country.

>output:
html_document:
fig_height: 4
fig_width: 6
toc: yes
toc_depth: 5
toc_float: yes

3. Choose word_document as a report output (you can also try powerpoint and PDF as output format)

4. Generate a report only for 31st Juli to 29th November 2021 removing all the code chunks.