PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub - LIRMM - Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier
Journal Articles IEEE Access Year : 2023

PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub

Abstract

GitHub hosts Git repositories and provides issues-tracking services to provide a better collaboration environment for software developers. Issues and Pull-Requests are frequently used in GitHub to discuss and review the software requirements (new features, bugs, etc.) and software solutions (source code, test cases, etc.) respectively. The links between Issues and their corresponding Pull-Requests comprise valuable information to keep tracking current development as well as documenting knowledge for future development. Considering a large number of links, such information can be used to train machine learning models for several purposes such as feature location, bug prediction and localization, recommendation systems and documentation generation. To the best of our knowledge, no dataset has been proposed as a ground-truth of links between Issues and Pull-Requests. In this paper, we propose, PI-Link, a new significant and reliable ground-truth dataset composed of 50369 links that explicitly connect 34732 Issues with 50369 Pull-Requests. These links are automatically extracted from all (907,139) Android projects in GitHub created between January 1, 2011 and January 1, 2021. To better organize and store the collected data, we propose a metamodel based on the concepts of Issues and Pull Requests. Moreover, we analyze the relationships between Issues and their linked Pull Requests based on four features related to their titles, bodies, labels and comments. The selected features are analyzed in terms of their lengths and similarities based on three lexical and one semantic similarity metrics. The results showed promising similarities between Issues and their linked PRs at the lexical and semantic levels. In addition, some feature similarities are sensitive to the text length, whereas other feature similarities are sensitive to the term frequency.
Fichier principal
Vignette du fichier
PI-Link_A_Ground-Truth_Dataset_of_Links_between_Pu.pdf (2.4 Mo) Télécharger le fichier
Origin Files produced by the author(s)
Licence

Dates and versions

lirmm-03980630 , version 1 (09-02-2023)

Licence

Identifiers

Cite

Zakarea Alshara, Anas Shatnawi, Hamzeh Eyal-Salman, Abdelhak-Djamel Seriai, Maad Shatnawi. PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub. IEEE Access, 2023, 11, pp.697-710. ⟨10.1109/ACCESS.2022.3232982⟩. ⟨lirmm-03980630⟩
31 View
33 Download

Altmetric

Share

More