if (!require("socialmixr")) install.packages("socialmixr")

## Loading required package: socialmixr

## 
## Attaching package: 'socialmixr'

## The following object is masked from 'package:utils':
## 
##     cite

library(socialmixr)
data(polymod)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## A table with all participants and their contacts in 'long' format, where
## participants with 0 contacts get a single row for which cont_id is NA.
allpartcont <- left_join(polymod$participants, polymod$contacts)

## Joining with `by = join_by(part_id)`

## A table with all participants and their contacts in 'long' format, where
## participants with 0 contacts are dropped.
partcont <- inner_join(polymod$participants, polymod$contacts)

## Joining with `by = join_by(part_id)`

## Participants without contacts have been dropped:
partcont %>%
    filter(is.na(cont_id)) %>%
    select(part_id:part_occupation, country, hh_size, sday_id:cnt_gender)

Question 1

(a)

Question: How do the distributions of the numbers of contacts per person vary from country to country, and is there a gender difference within each country?

Note: Some participants have blank for gender. You may either omit them or include them in the visualisation; if the latter, clearly indicate that their gender is missing.

Choice of chart

Since we are interested in the distribution of number of contacts from country to country and by gender, box plots are chosen as it effectively visualizes many distributions at once (Wilke (2019) Fundamentals of Data Visualization, Chapters 5).
In addition, the box plot is appropriate in the sense that the target audience is an academic or science popularization publication who should be assumed to be able to interpret the box plot.
The plot uses two position scales and one color scale. The variable “country” is mapped onto the x-axis, the number of contacts per person onto the y-axis, gender is mapped by color to show how the distributions of number of contacts vary across different countries and genders.
The colors hue scale (ggplot2 default) are chosen to clearly distinguish the gender, male and female while no single color stands out relative to the other or gives any impression of a specific order.

library(ggplot2)
library(dplyr)

allpartcont <- left_join(polymod$participants, polymod$contacts)

## Joining with `by = join_by(part_id)`

cnt_per_person <- allpartcont %>%
  group_by(part_id) %>%
  summarise(cnt_counts = ifelse(all(is.na(cont_id)), 0, n())) %>%
  left_join(allpartcont, by = "part_id") %>%
  select(part_id, cnt_counts, part_gender, country) %>%
  distinct(part_id, cnt_counts, part_gender, country) %>%
  filter(!is.na(part_gender) & part_gender != "")

ggplot(cnt_per_person, aes(factor(country), cnt_counts, fill = part_gender)) +
  geom_boxplot() +             
  labs(x = "Country",   
       y = "Number of Contacts per Person", 
       fill = "Gender")

Conclusion:

How do the distributions of the numbers of contacts per person vary from country to country, and is there a gender difference within each country?

The distribution of the number of contacts per person significantly vary from country to country. For example, Italy and Luxembourg show wide distribution with higher median number of contacts whereas, Germany has a narrow distribution with much smaller median number of contacts across the country.
Belgium and Finland, have notably similar distributions whereas the Netherlands, Poland and the UK are each more unique in their individual distribution numbers and shape.
There is no significant gender difference in the distribution of each country. For instance Germany shows almost identical distributions across genders whilst some minimal differences exist between genders in remaining countries.

(b)

Question: How do the means of the numbers of contacts per person vary from country to country and by gender? Also include information to let the viewer quickly assess if the difference may be statistically significant.

Choice of chart

Similarly in this chart, the variable “country” is mapped along the x-axis, means of the number of contacts per person onto the y axis, gender mapped by color and the colors hue scale (ggplot2 default) are used to clearly distinguish the male and female without one color standing out or giving any impression of on order.
Each mean number of contacts is represented by a dot shape to visually emphasize the position of the mean within the confidence interval.
The means within each gender are connected by a line to efficiently show whether the differences are statistically significant from country to country and let the viewer quickly assess its significance.
The confidence interval are shown in an I-shape to effectively check the overlap from country to country and by gender.
For each country, the plotted distance between each gender is set to 0.1 to position the error bars next to each other rather than overplotting in order to easily capture the mean differences by gender.

library(broom)

mean_CIs <- cnt_per_person %>%
  group_by(country, part_gender) %>%
  summarise(
   CI95 = tidy(t.test(cnt_counts, conf.level = 0.95))
  )

## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

dodge <- position_dodge(0.1)

ggplot(mean_CIs, aes(y = CI95$estimate, x = country, col = part_gender, group = part_gender)) +
  geom_errorbar(aes(ymin = CI95$conf.low, ymax = CI95$conf.high), width = .1, position = dodge) +
  geom_point(position = dodge) + 
  geom_line(position = dodge) + 
  labs(x = "Country",
       y = "Mean number of contacts",
       col = "Gender", 
       title = "95% confidence intervals")

Conclusion

How do the means of the numbers of contacts per person vary from country to country and by gender? Also include information to let the viewer quickly assess if the difference may be statistically significant.

In the above plot, the pink line and blue line have similar patterns though there are some slight differences. In addition, the confidence intervals within each country are overlapping. Therefore, gender as a variable does not have a significant impact on the mean value and so the mean of the number of contacts per person do not vary by gender.
Both the pink line and blue line values vary up and down greatly across countries. For example, the mean peaks at Italy and troughs at Germany and displays a large difference showing that the means significantly vary between countries. Furthermore, the fact that there are less overlappings in confidence intervals across countries evidently shows that country as a variable has a statistically significant impact on means of the number of contacts per person.

(c)

Question: What fraction of contacts occurred in which type of setting, breaking down by country?

Target Audience: General public, e.g., illustration to a newspaper article.

Choice of chart

The stacked bars were chosen as it is useful for visualizing the proportion especially when a large number of condition exists or comparing multiple sets of proportions (Wilke 5.3) and in this case each country has six setting types and nine countries exist. Additionally the easy interpretability of stacked bars is appropriate for the target audience, the general public.
The proportion is mapped onto the x-axis as it enhances the proportion comparison than when mapped on the y axis. “Country” is used as the variable for the y-axis to easily compare the proportion across countries and the setting onto color to distinguish the setting.
The categorical variable, “contact setting” does not have an inherent order and the proportion of settings vary across the countries. Thus, no single contact setting should stand out relative to the others. Therefore, the qualitative color scale “Okabe Ito” was used to clearly make distinct each setting type in the plot.
The contact setting breakdown plot below were ordered by countries with more similar proportions grouped together. Poland, positioned first was the most unique, with Italy and Luxembourg notably similar in their breakdown proportions, then the UK and Germany, then Finland, Belgium and the Netherlands.

library(tidyr)
partcontset <- partcont %>%
    rowwise() %>%
    mutate(cnt_nsettings = sum(c_across(cnt_home:cnt_otherplace))) %>%
    mutate(across(cnt_home:cnt_otherplace, function(x) x/cnt_nsettings)) %>%
    #select(-cnt_nsettings) %>%
    pivot_longer(cnt_home:cnt_otherplace, names_to = "cnt_setting", names_prefix = "cnt_",
        values_to = "cnt_setting_frac") %>%
    filter(cnt_setting_frac > 0) %>%
    mutate(cnt_setting = ifelse(cnt_setting == "otherplace", "other", cnt_setting))

frac_setting_country <- partcontset %>%
  group_by(country, cnt_setting) %>%
  summarise(setting_contacts = sum(cnt_setting_frac)) %>%
  
  group_by(country) %>%
  mutate(cnts_in_country = sum(setting_contacts)) %>%
  mutate(country_setting_cnts_frac = setting_contacts / cnts_in_country) %>%
  select(-cnts_in_country, -setting_contacts)

## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

frac_setting_country$country <- factor(frac_setting_country$country, 
                                       levels = c("Netherlands", "Belgium", "Finland", 
                                                  "Germany", "United Kingdom", 
                                                  "Luxembourg", "Italy", "Poland"))

#frac_setting_country
okabe_ito <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(frac_setting_country, aes(country_setting_cnts_frac, country, fill = cnt_setting)) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(x = "Proportion", y = "Country", fill = "Contact Setting") +
  scale_fill_manual(values = okabe_ito)

Conclusion

Question: What fraction of contacts occurred in which type of setting, breaking down by country?

It can be observed from the visualisations that across countries, the setting of “home” had the greatest fraction of contacts which intuitively makes sense given the significant time and close proximity shared by family members. Additionally, the other major settings in which affected contacts were “work”, “school” and “leisure” with roughly equivalent share of contacts. For all countries, transport contacts were the least frequent.
The Netherlands and Belgium which are neighboring countries also appear to have quite similar proportions of contacts across settings with Finland another country that share similar proportions. In both Netherlands and Finland, contacts through “leisure” were the biggest fraction with “home” being ranked closely second. However what was different was that for the Netherlands, school was the third largest setting, whilst for Belgium it was the fourth largest setting.
Germany and the United Kingdom, also appear to have similar proportionality of contact settings. Both countries had an especially similar ratio of contact settings between work and home with home being clearly the largest fraction of contacts. However some differences were that in the UK leisure was a much smaller source of contact, being the second smallest fraction whilst in Germany it was the second largest source of contacts as a setting.
Luxembourg and Italy had surprisingly similar proportionality of contacts across settings with each having the greatest fraction of contacts occurring in school settings with the rest apart from transport, evenly proportioned.
Poland was more of a unique case in which in which contacts mostly occurred at home and work but evenly between the two, and leisure was ranked low - the second lowest setting amongst settings in Poland and notably the smallest leisure proportion compared to all other countries in the dataset.

Question 2

Now, suppose that we would like to predict the number of contacts a participant has (and hence their propensity to be infected or to infect others) as a function of their various demographics included in the dataset, which include the participant’s age, gender, occupation, country, household size, if they are a child and if so if they are in child care, and others, as well as the day of the week of the diary (particularly weekend).

Notes:

Keep in mind that for small children, some of the answers pertain to their guardian. Refer to the data dictionary for which ones.
Some of the values are stored as numeric codes. Refer to the data dictionary, and recode them for regression.

Target Audience: The assumed audience for the plots for this question is either creating plots for your own exploration, or for sharing with a single colleague with similar technical knowledge.

(a)

Fit either a linear or a log-linear (Poisson) regression model for the number of contacts a participant has (which may be 0) on at least 3 predictors. Print out the summary() of the model.

When considering what variables to choose for the regression model, think about variables you would choose for prediction, e.g., which variables might be available in a census. You can use stepwise selection (? step), or some other reasonable approach. Keep in mind that quantitative predictors (e.g., household size, age) may have nonlinear effects.

if they are a child and if so if they are in child care

Keep in mind that for small children, some of the answers pertain to their guardian. Refer to the data dictionary for which ones.

participant <- allpartcont %>%
  group_by(part_id) %>%
  summarise(numOfcontacts = ifelse(all(is.na(cont_id)), 0, n()))

participant <- left_join(polymod$participants, participant) %>%
  mutate(pertain_guardian = ifelse(type == 3, 1, 0)) %>%
  mutate(is_child_in_care = ifelse(part_age > 18, "Oth", 
                           ifelse(part_age < 18 & child_care == "Y", "Y", "N"))) %>%
  select(numOfcontacts, part_age, part_gender, part_occupation, country, hh_size, is_child_in_care, pertain_guardian, dayofweek) %>%
  mutate(
    part_occupation = as.factor(part_occupation),
    is_child_in_care = as.factor(is_child_in_care),
    pertain_guardian = as.factor(pertain_guardian), 
    dayofweek = as.factor(dayofweek)
  )

## Joining with `by = join_by(part_id)`

participant <- na.omit(participant)

glm_8 <- glm(numOfcontacts ~ ., family = quasipoisson, data = participant)
summary(glm_8)

## 
## Call:
## glm(formula = numOfcontacts ~ ., family = quasipoisson, data = participant)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.5427977  0.2247176  11.316  < 2e-16 ***
## part_age              -0.0023407  0.0008661  -2.702  0.00690 ** 
## part_genderF          -0.2496676  0.2177424  -1.147  0.25158    
## part_genderM          -0.2817630  0.2178113  -1.294  0.19584    
## part_occupation2      -0.4235366  0.0366823 -11.546  < 2e-16 ***
## part_occupation3      -0.3293490  0.0390186  -8.441  < 2e-16 ***
## part_occupation4      -0.3835877  0.0450500  -8.515  < 2e-16 ***
## part_occupation5      -0.0889451  0.0290323  -3.064  0.00219 ** 
## part_occupation6      -0.2106734  0.0445615  -4.728 2.32e-06 ***
## countryFinland        -0.0520783  0.0349440  -1.490  0.13618    
## countryGermany        -0.3191339  0.0355017  -8.989  < 2e-16 ***
## countryItaly           0.4907709  0.0323998  15.147  < 2e-16 ***
## countryLuxembourg      0.3725931  0.0320537  11.624  < 2e-16 ***
## countryNetherlands     0.2075189  0.0482528   4.301 1.73e-05 ***
## countryPoland          0.3205378  0.0324281   9.885  < 2e-16 ***
## countryUnited Kingdom -0.0107515  0.0351176  -0.306  0.75950    
## hh_size                0.0730490  0.0063866  11.438  < 2e-16 ***
## is_child_in_careOth   -0.1575293  0.0338320  -4.656 3.28e-06 ***
## is_child_in_careY      0.3007638  0.0584672   5.144 2.76e-07 ***
## pertain_guardian1     -0.5919230  0.0620027  -9.547  < 2e-16 ***
## dayofweek1             0.3178249  0.0336201   9.453  < 2e-16 ***
## dayofweek2             0.3672660  0.0328238  11.189  < 2e-16 ***
## dayofweek3             0.3718579  0.0331706  11.210  < 2e-16 ***
## dayofweek4             0.3720652  0.0327572  11.358  < 2e-16 ***
## dayofweek5             0.3821826  0.0325235  11.751  < 2e-16 ***
## dayofweek6             0.1438691  0.0353902   4.065 4.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 5.862636)
## 
##     Null deviance: 51612  on 6944  degrees of freedom
## Residual deviance: 36404  on 6919  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

The p-values of gender for both levels are greater than 0.05 which are 0.25158 and 0.19584 respectively. Therefore we do not reject the null hypothesis and conclude that gender is not significant at significance level, 0.05 when predicting number of contacts. Hence the gender feature has been dropped.

glm_7 <- glm(numOfcontacts ~ part_age + part_occupation+country+hh_size+is_child_in_care+pertain_guardian+dayofweek, family = quasipoisson, data = participant)

drop1(glm_7)

All the variables used in glm_7 are significant as deviance has been increased after removing each variable.
Therefore, none of the feature sets are dropped and so final model would be glm_7.

summary(glm_7)

## 
## Call:
## glm(formula = numOfcontacts ~ part_age + part_occupation + country + 
##     hh_size + is_child_in_care + pertain_guardian + dayofweek, 
##     family = quasipoisson, data = participant)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.2779797  0.0560565  40.637  < 2e-16 ***
## part_age              -0.0024068  0.0008653  -2.782  0.00543 ** 
## part_occupation2      -0.4214419  0.0366565 -11.497  < 2e-16 ***
## part_occupation3      -0.3150429  0.0384035  -8.203 2.76e-16 ***
## part_occupation4      -0.3812247  0.0450219  -8.468  < 2e-16 ***
## part_occupation5      -0.0861940  0.0289876  -2.973  0.00295 ** 
## part_occupation6      -0.2088986  0.0445689  -4.687 2.82e-06 ***
## countryFinland        -0.0513752  0.0349330  -1.471  0.14142    
## countryGermany        -0.3187197  0.0354883  -8.981  < 2e-16 ***
## countryItaly           0.4887631  0.0323750  15.097  < 2e-16 ***
## countryLuxembourg      0.3709824  0.0320292  11.583  < 2e-16 ***
## countryNetherlands     0.2054641  0.0482300   4.260 2.07e-05 ***
## countryPoland          0.3184458  0.0324025   9.828  < 2e-16 ***
## countryUnited Kingdom -0.0126070  0.0350952  -0.359  0.71944    
## hh_size                0.0729310  0.0063854  11.421  < 2e-16 ***
## is_child_in_careOth   -0.1543480  0.0337845  -4.569 4.99e-06 ***
## is_child_in_careY      0.2983381  0.0584673   5.103 3.44e-07 ***
## pertain_guardian1     -0.5914634  0.0619962  -9.540  < 2e-16 ***
## dayofweek1             0.3174596  0.0336121   9.445  < 2e-16 ***
## dayofweek2             0.3675987  0.0328057  11.205  < 2e-16 ***
## dayofweek3             0.3717824  0.0331633  11.211  < 2e-16 ***
## dayofweek4             0.3716513  0.0327452  11.350  < 2e-16 ***
## dayofweek5             0.3817912  0.0325157  11.742  < 2e-16 ***
## dayofweek6             0.1440764  0.0353728   4.073 4.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 5.860109)
## 
##     Null deviance: 51612  on 6944  degrees of freedom
## Residual deviance: 36435  on 6921  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

(b)

Create at least 2 diagnostic plots for the regression model and comment on what each plot says about the model fit. For each plot, conclude whether there are any potential violations of linear model assumptions. Note: only the first 2 diagnostic plots will be marked

library(broom)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.4.1

participantaug <- augment(glm_7, participant)

high_abs_resid <- rank(-abs(participantaug$.std.resid)) <= 3

ggplot(participantaug, aes(x = .fitted, y = .resid,
                   label = ifelse(high_abs_resid, 
                                  seq_along(high_abs_resid), NA))) +
  geom_point() + 
  geom_smooth(se = FALSE) + 
  geom_text_repel() + 
  labs("Residuals vs. Fitted Plot", 
       x = "Fitted values", 
       y = "Residual")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: Removed 6942 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

In Residuals vs Fitted, while the residuals are mostly centered around zero, however there is a slight convex curve suggesting that the linear model does not fully capture the underlying relation between predictors and response variable such as non-linear effect. Therefore, the linearity assumption may not hold. The quantitative predictor such as household size or age could have a non-linear relationship between the number of contacts and this could be addressed by transforming the feature.

ggplot(participantaug, aes(x = .fitted, y = sqrt(abs(.std.resid)),
                   label = ifelse(high_abs_resid, 
                                  seq_along(high_abs_resid), NA))) +
  geom_point() + 
  geom_smooth(se = FALSE) + 
  geom_text_repel() + 
  labs(title = "Scale-Location Plot", 
       x = "Fitted values", 
       y = expression(sqrt(abs("Std. Residual"))))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: Removed 6942 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

In the Scale-Location plot, although the blue line is relatively flat, there is a slightly upward trend suggesting that as the fitted values increase, the variance of the residuals might slightly increase as well. Therefore, this could indicate the violation of the equal variance (homoscedasticity) assumption.

(c)

Create a forest plot of the model coefficients in your regression model. State your conclusions about the relationship between each variable in your regression and the predicted average number of contacts.

library(broom)

tidy(glm_7, conf.int = TRUE) %>% 
  ggplot(aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high)) + 
  geom_point() + 
  geom_errorbar(width = .1) + labs(y = NULL) +
  scale_y_discrete(limits = rev)

Since Finland and UK include 0 within their confidence interval, those countries would be not significant when predicting the number of contacts. Germany shows negative relationship with number of contacts whereas increasing the number of participant from either Italy, Luxembourg, Netherlands, and Poland increases the number of contacts tends to increases as well. Therefore “country” is significant when predicting the number of contacts.
The “dayofweek” shows a positive relationship implying that the specific day would increase number of contacts. “dayofweek6” as can be seen is less impactful in predicting the number of contacts.
The “hh_size” also shows a positive relationship with the number of contacts showing that as the house size increase the participant tends to have more contacts, however the value is close to zero therefore it has a relatively smaller impact compared to other variables.
Even though, the part_age does not include 0 within its confidence interval showing that the variable is significant, it is noticeably closed to 0. This suggest that the “part_age” has negative relationship between the number of contacts and so as participant age getting order the number of contacts decreased, but its effect might be less substantial due to its small estimate.
The plot shows that “is_child_in_careOth” has negative relationship with the number of contacts implying that if the participant is an adult (not child), then the number of contacts tends to decrease. On the other hand, “is_child_in_care_Y” has positive relationship showing that if the participant is child attending day care or school the number of contacts tends to increase.
The “part_occupation” has negative relationships with the number of contacts demonstrating that the types occupation cause less number of contacts. For example, if the participant is belong to part_occupation_2: Skilled non-manual (UK) the number of contacts are decreased. The number of contacts tends to less decreased if the participant belong to part_occupation 6: upper employee (LU/FI) as its absolute coefficient is the smallest among part_occupation.
The “pertain_guardian” have negative relationships showing that if the diary is pertained by guardian the number of contacts tends to decreased.

DATA5002 Assignment 1

Seoyeon Park(z5248219)

Question 1

(a)

(b)

(c)

Question 2

(a)

(b)

(c)