Towards Summarizing Code Snippets
Using Pre-Trained Transformers
When comprehending code, a helping hand may come from the natural language comments documenting it that,
unfortunately, are not always there. To support developers in such a scenario, several techniques have been
presented to automatically generate natural language summaries for a given code. Most recent approaches
exploit deep learning (DL) to automatically document classes or functions, while very little effort
has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single
statement).
Such a design choice is dictated by the availability of training data: For example, in the
case of Java, it is easy to create datasets composed by pairs <method, javadoc> that can be fed to DL
models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial
when it comes to inner comments (i.e., comments within a function) documenting a few statements.
In this
work, we take all steps needed to train a DL model to automatically document code snippets.
First, we
manually built a dataset featuring 6.6k comments that have been (i) classified based on their type
(e.g., code summary, TODO), and (ii) linked to the code statements they document.
Second, we used such a
dataset to train a multi-task DL model taking as input a comment and being able to (i) classify whether
it represents a "code summary" or not, and (ii) link it to the code statements it documents. Our trained
model identifies code summaries with 84% accuracy and is able to link them to the documented lines of
code with recall and precision higher than 80%.
Third, we run this model on 10k open source projects,
automatically identifying code summaries and linking them to the related documented code. This allowed
to build a large-scale dataset of documented code snippets that has then been used to train a new DL
model able to automatically document code snippets. A comparison with state-of-the-art baselines show
the superiority of the proposed approach which, however, is still far from representing an accurate
solution for snippet summarization.
You can find the GitHub repository containing all the data and models released here.