Before actually using a dynamic exercise in a course it should be thoroughly tested. While certain aspects require critical reading by a human, other aspects can be automatically stress-tested in R.
After a dynamic exercise has been developed, thorough testing is recommended before administering the exercise in a student assessment. Therefore, the R/exams package provides the function
stresstest_exercise() to aid teachers in this process. While the teachers still have to assess themselves how sensible/intelligible/appropriate/… the exercise is, the R function can help check:
- whether the data-generating process works without errors or infinite loops,
- how long a random draw from the exercise takes,
- what range the correct results fall into,
- whether there are patterns in the results that students could leverage for guessing the correct answer.
To do so, the function takes the exercise and compiles it hundreds or thousands of times. In each of the iterations the correct solution(s) and the run times are stored in a data frame along with all variables (numeric/integer/character of length 1) created by the exercise. This data frame can then be used to examine undesirable behavior of the exercise. More specifically, for some values of the variables the solutions might become extreme in some way (e.g., very small or very large etc.) or single-choice/multiple-choice answers might not be uniformly distributed.
In our experience such patterns are not rare in practice and our students are very good at picking them up, leveraging them for solving the exercises in exams.
Example: Calculating binomial probabilities
In order to exemplify the work flow, let’s consider a simple exercise about calculating a certain binomial probability:
|According to a recent survey
What is the probability (in percent) that
In R, the correct answer for this exercise can be computed as
100 * pbinom(k - 1, size = n, prob = p/100, lower.tail = FALSE) which we ask for in the exercise (rounded to two digits).
In the folllowing, we illustrate typical problems of parameterizing such an exercise: For an exercise with numeric answer we only need to sample the variables
k. For a single-choice version we also need a certain number of wrong answers (or “distractors”), comprised of typical errors and/or random numbers. For both types of exercises we first show a version that exhibits some undesirable properties and then proceed to an improved version of it.
||First attempt with poor parameter ranges.|
||As in #1 but with some parameter ranges improved.|
||Single-choice version based on #2 but with too many typical wrong solutions and poor random answers.|
||As in #3 but with improved sampling of both typical errors and random solutions.|
If you want to replicate the illustrations below you can easily download all four exercises from within R. Here, we show how to do so for the R/Markdown (
.Rmd) version of the exercise but equivalently you can also use the RLaTeX version (
for(i in paste0(extra, 1:4, ".Rmd")) download.file( paste0("http://www.R-exams.org/assets/posts/2019-07-10-stresstest/", i), i)
Numerical exercise: First attempt
In a first attempt we generate the parameters in the exercise as:
## success probability in percent (= pay with card) p
Let’s stress test this exercise:
By default this generates 100 random draws from the exercise with seeds from 1 to 100. The seeds are also printed in the R console, seperated by slashes. Therefore, it is easy to reproduce errors that might occur when running the stress test, i.e., just
i being the problematic iteration and subsequently run something like
exams2html() or run the code for generating the parameters manually.
Here, no errors occurred but further examination shows that parameters have clearly been too extreme:
The left panel shows the distribution run times which shows that this was fast without any problems. However, the distribution of solutions in the right panel shows that almost all solutions are extremely small. In fact, when accessing the solutions from the object
s1 and summarizing them we see that a large proportion was exactly zero (due to rounding).
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0200 0.5239 0.3100 4.7800
You might have already noticed the source of the problems when we presented the code above: the values for
p are extremely small (and even include 0) but also
k is rather large. But even if we didn’t notice this in the code directly we could have detected it visually by plotting the solution against the parameter variables from the exercise:
plot(s1, type = "solution")
Remark: In addition to the
solution the object
s1 contains the
runtime, and a
data.frame with all
objects of length 1 that were newly created within the exercise. Sometimes it is useful to explore these in more detail manually.
##  "seeds" "runtime" "objects" "solution"
##  "k" "n" "p" "sol"
Numerical exercise: Improved version
To fix the problems detected above, we increase the range for
p to be between 10 and 30 percent and reduce
k to values from 1 to 3, i.e., employ the following lines in the exercise:
Stress testing this modified exercise yields solutions with a much better spread:
Closer inspection shows that solutions can still become rather small:
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.31 20.31 50.08 47.41 72.52 94.80
For fine tuning we might be interested in finding out when the threshold of 5 percent probability is exceeded, depending on the variables
k. This can be done graphically via:
plot(s2, type = "solution", variables = c("p", "k"), threshold = 5)
So maybe we could increase the minimum for
p from 10 to 15 percent. In the following single-choice version of the exercise we will do so in order leave more room for incorrect answers below the correct solution.
Single-choice exercise: First attempt
Building on the numeric example above we now move on to set up a corresponding single-choice exercise. This is supported by more
learning management systems, e.g., for live voting or for written exams that are scanned automatically. Thus, in addition to the
correct solution we simply need to set up a certain number of distractors. Our idea is to use two distractors that are based on
typical errors students may make plus two random distractors.
Additionally, the code makes sure that we really do obtain five distinct numbers. Even if the probability of two distractors coinciding is very small, it might occur eventually. Finally, because the correct solution is always the first element in
ans we can set
exsolution in the meta-information to
10000 but then have to set
TRUE so that the answers are shuffled in each random draw.
So we are quite hopeful that our exercise will do ok in the stress test:
The runtimes are still ok (indicating that the
while loop was rarely used) as are the numeric solutions – and due to shuffling the position of the correct solution
in the answer list is completely random. However, the rank of the solution is not: the correct solution is never the smallest or the largest of the
answer options. The reasons for this surely have to be our typical errors:
plot(s3, type = "solution", variables = c("err1", "err2"))
err1 is always (much) larger than the correct solution and
err2 is always smaller. Of course, we could have realized this from the way we set up those typical errors but we didn’t have to – due to the stress test. Also, whether or not we consider this to be problem is a different matter. Possibly, if a certain error is very common it does not matter whether it is always larger or smaller than the correct solution.
Single-choice exercise: Improved version
Here, we continue to improve the exercise by randomly selecting only one of the two typical errors. Then, sometimes the error is smaller and sometimes larger than the correct solution.
Moreover, we leverage the function
num_to_schoice() (or equivalently
num2schoice()) to sample the random distractors.
Again, this is run in a
while() loop to make sure that potential errors are caught (where
NULL and issues a warning). This has a couple of desirable features:
num_to_schoice()first samples how many random solutions should be to the left and to the right of the correct solution. This makes sure that even if the correct solution is not in the middle of the admissible range (in our case: either close to 0 or to 100), it is not more likely to get greater vs. smaller distractors.
- We can set a minimum distance
deltabetween all answer options (correct solution and distractors) to make sure that answers are not too close.
- Shuffling is also carried out automatically so
exshuffledoes not have to be set.
Finally, the list of possible answers automatically gets LaTeX math markup
$...$ for nicer formatting in certain situations. Therefore, for consistency, the
binomial4 version of the exercise uses math markup for all numbers in the question.
The corresponding stress test looks sataisfactory now:
Of course, there is still at least one number that is either larger or smaller than the correct solution so that the correct solution takes rank 1 or 5 less often than ranks 2 to 4. But this seems to be a reasonable compromise.
When drawing many random versions from a certain exercise template, it is essential to thoroughly check that exercise before using it in real exams. Most importantly, of course, the authors should make sure that the story and setup is sensible, question and solution are carefully formulated, etc. But then the technical aspects should be checked as well. This should include checking whether the exercise can be rendered correctly into PDF via
exams2pdf(...) and into HTML via
exams2html(...) and/or possibly
exams2html(..., converter = "pandoc-mathjax") (to check different forms of math rendering). And when all of this has been checked,
stresstest_exercise() can help to find further problems and/or patterns that can be detected when making many random draws.
The problems in this tutorial could be detected with
n = 100 random draws. But it is possible that some edge cases occur only very rarely so that, dependending on the complexity of the data-generating process, it is often useful to use much higher values of
Note that in our
binomial* exercises it would also have been possible to set up all factorial combinations of input parameters, e.g., by
expand.grid(p = 15:30, n = 6:9, k = 1:3). Then we could have asssessed directly which of these lead to ok solutions and which are too extreme. However, when the data-generating process becomes more complex this might not be so easy. The random exploration as done by
num_to_schoice() is still straightforward, though.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…