Citation

R. Caldas, J. A. Pinera Garcia, M. Schiopu, P. Pelliccione, G. Rodrigues, and T. Berger, “Runtime Verification and Field-based Testing for ROS-based Robotic Systems,” IEEE Transactions on Software Engineering, vol. 50, no. 10, pp. 2544–2567, 2024. DOI: 10.1109/TSE.2024.3444697

DOI

BibTeX

@ARTICLE{caldas2024guidelines,
  author={Caldas, Ricardo and Pi{\~n}era Garc{\'i}a, Juan Antonio and Schiopu, Matei and Pelliccione, Patrizio and Rodrigues, Gena{\'i}na and Berger, Thorsten},
  journal={IEEE Transactions on Software Engineering},
  title={Runtime Verification and Field-based Testing for ROS-based Robotic Systems},
  year={2024},
  volume={50},
  number={10},
  pages={2544--2567},
  keywords={Robots;Testing;Runtime;Guidelines;Software;Quality assurance;Robot kinematics;Field-based Testing;Runtime Verification;Robotic Systems;Robot Operating System (ROS);Uncertainty;Guidelines},
  doi={10.1109/TSE.2024.3444697}
}

Authors

Ricardo Caldas (primary contact)

PhD. Candidate, Chalmers University of Technology, Interaction Design and Software Engineering, Computer Science and Engineering

Contact Information

Email: ricardo.caldas@gssi.it


Juan Antonio Pinera Garcia

PhD Candidate, Gran Sasso Science Institute

Contact Information


Matei Schiopu

Assistant Researcher, Chalmers University of Technology

Contact Information


Patrizio Pelliccione

Director of the Computer Science area and Professor in Computer Science and Software Engineering at GSSI (Gran Sasso Science Institute), Italy.

Contact Information


Genaína Nunes Rodrigues

Associate Professor of Computer Science, University of Brasilia

Contact Information


Thorsten Berger

Chair of Software Engineering / Professor Ruhr University Bochum, Germany

Contact Information


Evaluation

This section presents the outcomes of our validation with industry and academia experts, conducted through questionnaires available in the replication package section of the website. Our aim is to assess whether the guidelines synthesized from literature and insights from open-source ROS repositories are perceived as useful, clear, and applicable by developers and QA teams testing and verifying ROS code. We formulate three hypotheses per guideline:

  • H1. Overall, the guideline is useful. (Usefulness)
  • H2. The formulation of the guideline is clear. (Clarity)
  • H3. The guideline is applicable to ROS-based systems. (Applicability)

Unlike usefulness, applicability refers to the extent to which respondents consider that the guideline could be directly applied to ROS-based systems they have worked with. Conversely, usefulness pertains to ROS-based systems in general.

Summary of Respondent Profiles

ID Time Experience Organization Domain Role Individuals
P1 >10y Contributed to ROS packages Academia and Industry Service Robotics Developer 18
P2 3-5y Worked on applications using ROS Industry Industrial automation Developer 17
P3 1-3y Contributed to ROS packages Academia General-purpose Developer 6
P4 1-3y Worked on applications using ROS Academia and Industry General-purpose QA 3
P5 3-5y Worked on applications using ROS Academia and Independent Groups Marine Robotics QA 11

We received 55 questionnaire responses, with 33 from developers and 22 from QA teams. These responses were obtained through targeted emails and posts in the ROS discourse forum. The questionnaire was tailored to test the three hypotheses, and respondents were also asked about their experience in robotics, ROS experience, organizational background, and robotics domains they’ve worked on.

Likert Plot for Questionnaire Answers

Likert plot for the questionnaire answers

Figure 1: Likert plot for the questionnaire answers

The Likert plot provides an overview of the questionnaire responses. While most votes lean towards ‘Agree,’ some guidelines received ‘Strongly Disagree’ votes. We focus on analyzing disagreements, as they may indicate areas for further research.

Boxplots for the results bird's eye view

Figure 2: Boxplots for the results bird’s eye view

In addition to the Likert plot, we conducted statistical analysis using boxplots to visualize whether the hypotheses hold for each guideline. Overall, developers and QA teams agree that the synthesized guidelines are applicable, useful, and clear.

Statistical Significance

Wilcoxon one-sample test

Figure 3: Wilcoxon one-sample test for statistical significance (hypothesis: μ ≥ 0)

We tested the hypotheses for statistical significance using the one-sample Wilcoxon test. The tests confirm the statistical significance of the correlation between guidelines and attributes of applicability, clarity, and usefulness, except for guidelines CI2, MTA2, and PE2.

Practical Significance

Effect size for one sample Wilcoxon test

Figure 4: Effect size for one sample Wilcoxon test for practical significance

Practical significance was assessed using the effect size calculated from the one-sample Wilcoxon signed-rank test. Guidelines with small effect sizes (CI2, MTA2, and PE2) required further data to strengthen conclusions, while guidelines with moderate and large effect sizes justify their practical significance.

Ultimately, we conducted a follow-up discussion with respondents who disagreed with the usefulness/applicability of CI2, MTA2, and PE2, and modified the guidelines.

Replication Package

Would you like to replicate our study? This page will take you through the steps to replicate our study. We use the methodology to guide you through the replication process. We understand that some steps are hard or even impossible to replicate, thus we provide artifacts (protocols, scripts, spreadsheets with results) generated during our study to help you follow along. Please, contact us in case you have any doubts.

Before you start, click here to download the replication package as a .zip.

We followed design science to synthesize the guidelines. We guide you through each cycle of the design science process from Fig 5.

Figure 5: Activities to synthesizing guidelines according to design science

First Cycle: Survey Elicitation and Terminology

The first cycle is devoted to establishing awareness of the state-of-the-art and building the terminology used in the study.

We elicit important surveys in runtime verification and field-based testing:

  • 1. Afzal, Afsoon, et al. "A study on challenges of testing robotic systems." 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, 2020.
  • 2. Albonico, Michel, et al. "Software engineering research on the Robot Operating System: A systematic mapping study." Journal of Systems and Software 197 (2023): 111574.
  • 3. Bertolino, Antonia, et al. "A survey of field-based testing techniques." ACM Computing Surveys (CSUR) 54.5 (2021): 1-39.
  • 4. Falcone, Yliès, et al. "A taxonomy for classifying runtime verification tools." International Journal on Software Tools for Technology Transfer 23.2 (2021): 255-284.
  • 5. Garousi, Vahid, Michael Felderer, and Feyza Nur Kılıçaslan. "A survey on software testability." Information and Software Technology 108 (2019): 35-64.
  • 6. Luckcuck, Matt, et al. "Formal specification and verification of autonomous robotic systems: A survey." ACM Computing Surveys (CSUR) 52.5 (2019): 1-41.
  • 7. Malavolta, Ivano, et al. "Mining guidelines for architecting robotics software." Journal of Systems and Software 178 (2021): 110969.

From the surveys, we extracted the following terminology. We present it in the format of a search string since the terminology intended to be used in a systematic literature review (in Cycle 2):

IEEE Xplore Search String = ("ros" OR "robotic operating system" OR "robot operating system") AND ("runtime verification" OR "run−time verification" OR "runtime assurance" OR "run−time assurance" OR "online verification" OR "on−line verification" OR "runtime monitoring" OR "run−time monitoring" OR "runtime testing" OR "run−time testing" OR "online testing" OR "on−line testing" OR "field−based testing" OR "field testing" OR "in−vivo testing")
Scopus Search String =TITLE−ABS−KEY ( ( ros OR "robotic operating system" OR "robot operating system" ) AND ( "runtime verification" OR "run−time verification" OR "runtime assurance" OR "run−time assurance" OR "online verification" OR "on−line verification" OR "runtime monitoring" OR "run−time monitoring" OR "runtime testing" OR "run−time testing" OR "online testing" OR "on−line testing" OR "field−based testing" OR "field testing" OR "in−vivo testing"))
ACM DL Search String = Fulltext:(("ros" OR "robotic operating system" OR "robot operating system") AND ("runtime verification" OR "run−time verification" OR "runtime assurance" OR "run−time assurance" OR "online verification" OR "on−line verification" OR "runtime monitoring" OR "run−time monitoring" OR "runtime testing" OR "run−time testing" OR "online testing" OR "on−line testing" OR "field−based testing" OR "field testing" OR "in−vivo testing"))

The terminology was validated internally, through peer reviews and discussions with co-authors to consolidate the scope of our study.

Second Cycle: Literature Review, Guideline Templates, External Validation

The Second Cycle focused on performing a literature review, synthesizing the guideline templates, and performing an external validation of the template.

Systematic Literature Review

We used the search string to perform a systematic literature review to understand the studies discussing runtime verification and field-based testing for ROS applications. The SLR followed the activities in Fig. 6.

Figure 6: Activities to conducting the systematic literature review

We present precise descriptions for each activity in a protocol.

In short, we searched for the string in IEEEXplore, ACM Digital Library, and Scopus.

Then, we used the inclusion and exclusion criteria from Tab. 1 to filter out relevant studies.

ID Description Reasoning
IC_1 ROS-based application. Including ROS1 and ROS2. Robotic Operating System (ROS) is a must
IC_2 Explicit description (or reference to peer-reviewed venue) of the verification, validation, or testing technique. Papers that do not explicitly describe the employed technique may lead to ambiguous interpretation
EC_1 Tutorial, artifact, short paper (less than 5pgs), keynote, secondary studies, roadmaps, duplicated study Such papers do not provide enough contextual information.
EC_2 Verification, validation, or testing techniques that do not make use of the ROS Ecosystem Papers targeting V&V of non-ROS applications should be excluded.
EC_3 Verification, validation, or testing techniques that do not address solely hardware properties Papers targeting V&V of hardware should be excluded
EC_4 Papers that are not in Computer Science or Engineering fields Papers from other fields should be excluded since they do not discuss means to engineering robotics systems.

Table 1: Inclusion Criterion (IC) and Exclusion Criterion (EC)

Finally, we selected the runtime verification and field-based testing approaches according to a classification scheme. spreadsheet

Guidelines Template

Then, we understood what was necessary for a guideline and designed a template, see Tab. 2. We used the template to synthesize our initial set of guideline sketches.

Element Description
ID Identifier used to facilitate tracing guidelines between groups
Title Title summarizes an action that practitioners should follow to mitigate or avoid a recurring problem.
Context (WHEN) The Context is a paragraph placing the guideline among a known set of conditions. This paragraph should delimit the scope in which the guideline is applicable. It should also introduce the conceptual terminology used in the guideline, which is defined by the conditions under which the guideline is valid.
Reason (WHY) Reason introduces the recurring problem faced by practitioners. It intends to leverage the relevance of the guideline to practitioners
Suggestion (WHAT) Suggestion is a sentence or two introducing WHAT should the practitioners do to mitigate the recurring problem
Process (HOW) Process is a paragraph that carefully guides the practitioners through HOW they can practice the guideline. In this paragraph, there should be references to tools that may help, concrete examples, or references to precise explanations of researchers or practitioners who have done something similar.
Exemplars Exemplars are concrete descriptions of papers/artifacts that follow the guidelines. An exemplar can be generic such as an artifact or rather specific such as a model problem in testing ROS-based systems in the field.
Strengths Strengths is a list of benefits that the practitioners should consider when applying the guideline
Weaknesses Weaknesses is a list of technological/theoretical barriers that may slow the actual implementation of the guideline, either by undesired side-effects or scenarios in which applying the guideline might not lead to the desired effect.

Table 2: Template for Guideline Specification

External validation

Finally, we shared, through a questionnaire, the sketches with members of the Robotics Software Engineering Workshop. Three experts sent us their feedback.

Third Cycle: Repos Mining, Guidelines with Exemplars, Online Survey

The third cycle was devoted to consolidating the guidelines with concrete examples, clustering, organizing the guidelines, and validating with researchers and practitioners.

Consolidating the Guidelines

To consolidate the guidelines we further looked for specific papers on the topic of the guideline and performed a data mining. For the specific search, we used Google Scholar with terms extracted from the guideline sketches. The repositories mining followed the procedure described in Fig. 7.

Figure 7: Data mining activities

For more details, check the GitHub repository with scripts and the protocol with details explaining the steps in Fig. 7.

This step resulted in the guidelines catalog. Each guideline follows a template and they are all documented in a separate guidelines.pdf file.

Online Survey

Finally, we validated the final set of guidelines with practitioners and researchers. To this end, we designed and distributed an online questionnaire.

The questionnaire was split in two targeting developers or quality assurance teams.

The results from the third cycle confirmed the guidelines’ relevance and helped us fine-tune them. The analysis that followed up the questionnaires is available in the Evaluation section above.