Automatic Identiﬁcation of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Details: Written by Hamzeh Eyal Salman; Category: Information Technology; Published: 24 March 2024; Hits: 567

Software engineering

Abstract :

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to ﬁx an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similarpull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we ﬁrst extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to ﬁnd similarities among pull-requests. Finally, machine learningalgorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtainedresults, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

Automatic Identiﬁcation of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

E-Publications

Login Form