Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Hamzeh Eyal-Salman; Zakarea Alshara; Abdelhak-Djamel Seriai

doi:10.3390/info13020073

Article Dans Une Revue Information Année : 2022

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

(1) , (2) , (3)

1
2
3

Hamzeh Eyal-Salman

Fonction : Auteur
PersonId : 1126680
ORCID : 0000-0003-3258-7304

Mutah University [Jordanie]

Zakarea Alshara

Fonction : Auteur
PersonId : 1225401
ORCID : 0000-0002-2727-6985

Jordan University of Science and Technology [Irbid, Jordan]

Abdelhak-Djamel Seriai

Fonction : Auteur
PersonId : 170191
IdHAL : abdelhak-djamel-seriai
ORCID : 0000-0003-1961-1410
IdRef : 059927712

Models And Reuse Engineering, Languages

Résumé

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

Mots clés

pull-requests similarity GitHub machine learning code changes review

Domaines

Génie logiciel [cs.SE]

Fichier principal

information-13-00073.pdf (654.89 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Isabelle Gouat : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03586823

Soumis le : jeudi 24 février 2022-10:42:03

Dernière modification le : jeudi 7 novembre 2024-16:14:03

Archivage à long terme le : mercredi 25 mai 2022-18:20:18

Dates et versions

lirmm-03586823 , version 1 (24-02-2022)

Licence

Paternité

Identifiants

HAL Id : lirmm-03586823 , version 1
DOI : 10.3390/info13020073

Citer

Hamzeh Eyal-Salman, Zakarea Alshara, Abdelhak-Djamel Seriai. Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning. Information, 2022, 13 (2), pp.73-97. ⟨10.3390/info13020073⟩. ⟨lirmm-03586823⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS MAREL LIRMM UNIV-MONTPELLIER

54 Consultations

80 Téléchargements

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager