Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning - LIRMM - Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier
Journal Articles Information Year : 2022

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Abstract

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.
Fichier principal
Vignette du fichier
information-13-00073.pdf (654.89 Ko) Télécharger le fichier
Origin Files produced by the author(s)

Dates and versions

lirmm-03586823 , version 1 (24-02-2022)

Licence

Identifiers

Cite

Hamzeh Eyal-Salman, Zakarea Alshara, Abdelhak-Djamel Seriai. Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning. Information, 2022, 13 (2), pp.73-97. ⟨10.3390/info13020073⟩. ⟨lirmm-03586823⟩
53 View
75 Download

Altmetric

Share

More