Semantic Source Code Segmentation Using Small And Large Language Models

URL

https://arxiv.org/pdf/2507.08992.pdf

Stage

Normal Science

Paradigm framing

The research is situated within the software engineering paradigm that treats source code as a sequence amenable to analysis by language models. Specifically, it works within the sub-paradigm of using transformer-based models (both small and large) for semantic code comprehension tasks. This paper extends this approach to the problem of code segmentation for a low-resource, domain-specific language (R in social sciences), which previously relied on less effective syntactic methods.

Highlights

The paper is classified as Normal Science because it does not propose a revolutionary framework but rather engages in "puzzle-solving" within an existing one. It accepts the paradigm of using language models for code analysis and applies it to a new problem: semantic segmentation of R code. The work involves creating a new dataset, systematically comparing different model architectures (SLMs vs. LLMs) and methodologies (line-by-line vs. range-based), and refining the application of the paradigm. This incremental and cumulative work, which demonstrates that fine-tuned smaller models can outperform larger ones, exemplifies progress within a stable scientific framework.

Semantic Source Code Segmentation using Small and Large Language Models

Leave a Comment Cancel Reply