Most good empirical software engineering papers that contain a study follow the same structure for its presentation. As far as I know, this structure was not invented by a single researcher, but developed gradually over the course of many publications.
Professional readers expect your case study to follow this structure, too. The audience that really matters for your publication—your thesis supervisor, his PhD advisor or program committee members—all are professional readers.
The goal of this article is to describe this structure: the basic building blocks of thesis chapters or paper sections that make up case study presentations. It is meant as an introduction and thus necessarily skips details. For further reading, this article contains links at the end.
The typical structure comprises these sections:
- Research questions
Results & Interpretation
Threats to Validity.
As a reader, I expect each section to answer a specific set of questions. In the following, I describe the gist of each section, its set of questions and common mistakes.
To make the sections more tangible, I use part of a study from one of our papers. The study investigates inconsistencies in code clones.
This section states the questions that the study aims to answer and their rationale. It should contain:
- What the questions are. In my paper, RQ 1 is Are clones changed inconsistently?.
Why the research questions are relevant.
A frequent mistake is missing rationale. In such papers, the motivation behind the research question often remains unclear or unconvincing.
Some background on the example paper: code clones are duplicated pieces of source code in a software systems. Clones are typically created by copy & paste. They hinder software maintenance, since changes must often be made to all clone instances. If a clone gets forgotten during such a change, the code becomes inconsistent. This inconsistency can be a bug.
What I wanted to investigate with my study, was how big of a problem this is in practice. One the one hand, I had seen some instances of inconsistent clones that suspiciously looked like bugs. On the other hand, I had no idea how frequently this occurred, and if this really was problematic in practice. My study goal was to quantify this by analyzing clones (and their inconsistencies) in real systems.
The rationale of the first research question was to understand if inconsistent changes to clones happen at all, and how often. If they are very rare, they probably do not deserve further investigation (which is performed by the later research questions in the paper).
This section outlines the study objects (e.g. software systems), which the study analyzes to answer its research questions. It should contain:
- The names of the study objects and their characteristics (those properties that are relevant for the study). In the example, the study objects were the 5 systems that I searched for clones. The relevant characteristics comprise programming language, size, age, number of developers and a short description of their functionality.
Why (and maybe how) those objects were chosen. This is relevant, since choice can influence study result validity. For the example, a large number of study objects (and ideally their random selection from a large pool of potential study objects), would increase the generalizability of the study results.
In the clone paper, however, I needed to do interviews with the system’s developers for later questions. I thus had to rely on our industry contacts to get hold of these developers. This limited my choices and thus potentially affects generalizability of the results (which is mentioned in the threats to validity section).
A frequent mistake is to not mention why those objects were chosen and what the consequences of the choice are. As a reader, this makes me wonder if the selection was manipulated to better produce the answers the author was looking for.
If a study involves data from industry, the study object names are often anonymized (e.g. replaced by A, B, C, …). As a reader, I don’t care about this, since the names of proprietary industrial systems are meaningless to me anyway. For the authors, however, it makes it much easier to get clearance to publish these results.
This section describes how the study, using the information from the study objects, attempts to answer the research questions.
For the clone study, I computed the percentage of inconsistent clones among all clones. For this, I defined two sets:
- C: The set C of consistent clones. The clones in each clone group are consistent (i.e. contain no differences or only small ones, like renamed variables).
IC: Set of inconsistent clones, i.e. clone groups with substantial differences between clones, such as missing statements.
As the answer to the research question, I computed the inconsistent clone ratio as |C| / |IC|. Intuitively, it denotes the probability that a clone group in the system contains at least one inconsistency.
A common mistake is to interleave study design, procedure and implementation details.
This section describes the nitty gritty details required to implement the study design in reality. In principle, they could also be included directly in the description of the study design. However, it is easier for the reader to first understand the general idea, and then the details.
For the clone study, this section states detection parameters (like minimal clone length and number of allowed differences between clones). It also treats handling of false positives, generated code and overlapping clone groups.
This section describes the results and interprets them with respect to the research questions. Since there is often a lot of data, this section should guide the reader through the results. In studies with large amounts of data, it is often easier to read to separate description of the data from its interpretation.
In the example, the paper presents the results for each study object and then the aggregated ratio. On average, 52% of the clone groups contained inconsistencies. The paper thus answers the question positively: yes, clones are changed inconsistently.
A common mistake is to mix the results with the discussion. This makes it harder for the reader to separate backed-up results from speculation.
Interpretation of the results that go further than the research questions. This can, e.g., contain implications for software development.
The clone paper (based on the above presented and further questions) concludes, that clones are a threat to program correctness, implying that their proper management deserves more attention.
Threats to Validity
This section lists all threats, i.e. reasons why the study results could be wrong. Ideally, it then treats every single threat and describes what you did to make sure that this threat does not invalidate your study results.
Threats to validity are often classified into internal and external threats.
Internal threats are reasons why the results could be invalid for your study objects. In the example, the parameter values of the clone detector have a strong impact on the detected clones. The section states that we mitigated the threat through a pre-study we performed in order to validate the chosen parameter values.
(To be honest, this is a weak mitigation. What it really says is that we tinkered with the values until they felt good and then did the study. A stronger mitigation would be to also perform the study with different parameter values and investigate whether the general results still hold. Since this distracts from the main study, such back-up studies are often only described in a much abbreviated fashion in the threats section itself.)
External threats are reasons why the results encountered for the study objects might not be transferable to other objects. In the example, the way we chose the study objects (through our personal network) might bias our results. To mitigate this threat, we at least chose systems that had different characteristics, such as programming language, development contractor and age.
The most common mistake is to ignore threats entirely. Much better (but still improvable) is to state a threat without giving a mitigation or an estimation of its severity.
The case study structure described in this article can be used in two different decomposition styles. The most common one is described in this article. It orders by section first and by research question second:
- Research questions
1.1: RQ 1 …
1.2: RQ 2 …
2.1 For RQ 1: …
2.2 For RQ 2: …
2.1 For RQ 1: …
2.2 For RQ 2: …
Is most frequent alternative, however, is to order by research question first and by section second:
- RQ 1
1.1 Research question 1 …
1.2 Study Objects for RQ 1 …
1.3 Study Design for RQ 1…
2.1 Research question 2 …
2.2 Study Object for RQ 2 …
2.3 Study Design for RQ 2 …
Both decomposition styles have advantages and drawbacks. I use these heuristics to select the decomposition level:
- By study sections: when the study objects are the same and the design and procedure are similar or build upon each other.
This is the case in the clone paper example. Research questions two and three ask whether the inconsistencies between clones are unintentional, and if so, whether they represent a fault. RQ n thus builds upon the results of RQ n-1. Since the study sections share so much, describing them in isolation would create a lot of redundancy. They are thus easier to read all at once. Decomposition by study section facilitates this.
By research questions: when each study has its own study objects, design and procedure.
In this paper we wrote, the study objects, design and procedure of research questions one and three have nothing in common. Since there is little synergy between them, it is easier to read a complete study—from question to results interpretation—before reading the next one.
Apart from the above examples, there are mixed cases as well (where some RQs share objects and design, but others in the same paper don’t). For them, simply choose the decomposition style that feels right, but stick to it for the entire study description. Don’t mix decomposition styles, since this confuses the reader.
From my experience, you only really get to feel if a style feels right, once you write it down, often two times, once in each decomposition style. This is tedious, but pays off, since a suitable decomposition style strongly increases the readability of your study.
- Guidelines for conducting and reporting case study research in software engineering by Per Runeson & Martin Höst.
Case Study Research. Design and Methods by Robert K. Yin.
Thanks to Rainer Koschke and Stefan Wagner for literature suggestions and to Daniela Steidl for reading drafts of this.