Are Large Language Models Capable of domain-specific Text Summarization?

draft of an unpublished paper

abstract: | Abstractive text summarization and several state-of-the-art summarization models have gained considerable interest in recent years. All these models, however, are usually benchmarked against a general-purpose corpus, and their performance on domain-specific text summarization is yet to be determined. This paper presents an overview of some representative large language models (LLMs) based on the research gaps they address and then categorizes them based on their usability guidelines and design principles. We also selected three open-source text summarization datasets, chosen based on their domain complexity, providing a unified framework for assessing various LLMs in specialized domains. We evaluate contemporary models against the selected datasets while trying to optimize each model for the best performance using their usability guidelines. Our experiments show that PEGASUS-X, an Efficient Transformer fine-tuned on a 16K context window outperforms all other LLMs including direct inference on GPT 3.5. Additionally, we observed that increasing the context window only slightly increases the model performance and corroborates the fact that bigger models do perform better. This study serves as a crucial resource for researchers aiming to develop and compare large language models for domain-specific abstractive summarization.

title: Are Large Language Models Capable of domain-specific Text Summarization?

Introduction

Abstractive Text Summarization has been an active research area in the past years, and while state-of-the-art models can produce human competitive summaries, they are more suitable for general-purpose text. The performance of these models deteriorates when tested on a domain-specific text summarization task. One common explanation is the shift in the dataset distribution as most of the large language models (LLMs) are pre-trained on general-purpose corpora such as C4 [@raffel2020exploring], and hence do not fully comprehend the fine-grained linguistic details and concepts of a niche area such as the medical, scientific, or legal domain.

Apart from the domain-adaptation capabilities, an additional challenge in abstractive summarization is the associated large document size [@Afzal2023challenges]. Most of the text that needs to be summarized is large in size, and basic text summarization models cannot handle it because of the input size limitation of 512 or 1024 tokens. A simple workaround has been truncating the input text, leading to a loss in context size that hinders the model's performance. At this time, GPT-3.5 ^1 offers a 16K token context window, and GPT-4 [@openai2023gpt4] up to a 32K context window. However, both of these models are closed-domain and only accessible through an API.

Over the years, several models suitable for the abstractive text summarization task have been released, each following a different design principle and usability guidelines. Firstly, we had the transformer-based Seq2Seq models like T5 [@Raffel2020t5] and BART [@lewis2019bart], depicting a classic encoder-decoder architecture while being pre-trained on a large corpus and later fine-tuned on a smaller domain-specific dataset. Despite showing great performance, these models still suffer from the quadratic complexity emerging from the self-attention matrix and are thus limited to handling only 512 or 1024 tokens, respectively. An initial attempt to reduce the quadratic complexity was illustrated in the architectures employed by the Efficient Transformers [@tay2022efficient] family. Longformer-Encoder-Decoder [@beltagy2020longformer] or BigBirdPegasus [@zaheer2021big] with a sparse self-attention matrix scaled the input length up to 4096 tokens. However, the most recent architectures like LongT5 [@guo2022longt5] and Pegasus-X [@phang2022investigating], utilizing the same approach, scaled the input text length limitation up to 16K tokens, while still, mostly, preserving model performance.

While there is no denying the above models' abilities, their performance on domain-specific data and in general their domain-adaptation capabilities are yet to be evaluated. This paper intends to evaluate one representative model of each class on their domain-specific text summarization capabilities while taking into account their usability guidelines such as fine-tuning or direct inference. Nevertheless, given the recent surge in the number of LLMs, we felt it to be appropriate to take several models into consideration, differing in model size, context size, and overall architecture. In general, vanilla Seq2Seq models such as BART, BigBirdPegasus, and PEGASUS-X are meant to be fine-tuned on a downstream task. On the other hand, GPT-like models are more suitable for direct inference or in-context learning approaches [@brown2020language].

Additionally, we propose a set of datasets against which we evaluate our models, providing a standard benchmark to evaluate model performance on domain-specific summarization. We select these datasets based on their large document size and the specificity of the textual domain represented. We further elaborate on this benchmark in [sec-benchmark]{reference-type="autoref" reference="sec-benchmark"}. Through our experiments, we tried to answer the following two theoretical questions:

  1. Does allowing more text as input improve the quality of the generated summary for the domain-specific text summarization task?

  2. Are ChatGPT-like LLMs, that are not meant to be fine-tuned, able to perform competitively on a domain-specific summarization task?

Finally, we present a taxonomy in which we categorize text summarization models into standard Encoder-Decoder Transformer models, Efficient Transformers, and GPT-like models (LLMs) with billions of parameters. We compare the performance between these categories by experimenting with some representative models as explained in [sec-methodology]{reference-type="autoref" reference="sec-methodology"}.

Background

Quadratic Complexity of Transformers

Since the introduction of the original Transformer architecture by @vaswani2017attention, its attention mechanism has become a cornerstone for numerous state-of-the-art natural language processing models, since it represents a vast increase in performance and efficiency compared to the traditional LSTMs [@10.1162/neco.1997.9.8.1735]. However, despite how successful these models have become, they maintain quadratic complexity in the attention module, leading to severe computational challenges when working with large documents pervasive in our environment (e.g. books, research articles, and legal documents, among others).

Large Language Models

The history of LLMs showcases a steady and remarkable evolution. Their capabilities have significantly expanded over time due to increased model size, larger datasets, and a plethora of algorithmic innovations. The groundbreaking work by @vaswani2017attention presented the Transformer model, which introduced the self-attention mechanism, enabling models to consider long-range dependencies in text and initiating a new era in natural language processing. These models are trained with the simple objective of predicting the next word given a specific context, which quite surprisingly is sufficient to promote quite impressive reasoning and writing abilities, provided that enough scale is in play.

This realization led to an escalating trend towards larger models. Work like GPT-4 [@openai2023gpt4] and PaLM [@chowdhery2022palm] expanded on Transformer's capabilities, being trained on enormous text corpora and showcasing impressive performance on a broad set of natural language understanding and generation tasks. They showed remarkable zero-shot and few-shot learning capabilities, leading to a paradigm shift in how we approach task-specific training, foregoing fine-tuning task-specific models and instead relying on a larger, general, language model.

Efficient Transformers

On the other hand, the original Transformer architecture has issues scaling to larger token counts due to the novel attention mechanism itself. To address this, researchers have proposed a plethora of efficient models which aim to reduce the quadratic nature of attention to a linear basis. Furthermore, they can be roughly clustered [@tay2022efficient] based on their optimization approaches which can differ quite substantially. Some noteworthy examples include making clever use of memory access patterns with FLASH attention [@dao2022flashattention], explicitly learning attention patterns ([@tay2020sparse; @kitaev2020reformer]), computing a low-rank representation of the attention matrix [@choromanski2022rethinking; @wang2020linformer] and the computation of fixed local and/or global attention patterns ([@zhu2021longshort; @beltagy2020longformer; @zaheer2021big]).

Naturally, these differ in implementation complexity and hardware compute efficiency, making the standalone evaluation of their performance troublesome. Regardless, released attempts at benchmarking ([@zhang2022cab; @xiong2022simple]) these optimizations show a key takeaway: local attention modules with fixed or almost fixed attention patterns, which focus on computing attention against adjacent tokens, have overshadowed some of the more complex attention patterns listed above which attempt to approximate the global attention matrix. This suggests that the information present in the neighboring tokens is mostly sufficient to achieve strong performance in downstream tasks.

Furthermore, when considering contemporary models, we can effectively verify which optimizations have withstood the test of time by observing which of them persist in the efficient adaptations of previously well-received models such as PegasusX [@phang2022investigating], BART-LS [@xiong2022adapting], LongT5 [@guo2022longt5].

Not surprisingly, these "proven" optimizations coincide with most of the attention benchmark findings (see, for example, @phang2022investigating and its staggered block-wise attention mechanism similar to the aforementioned fixed attention patterns). Following this conclusion, our model selection, discussed in a later section, attempts to reflect the attention module timeline discussed here.

Transfer Learning

Since it takes lots of time and hardware resources to train a large language model, Transfer Learning allows us to reuse the pre-trained model weights for specific tasks/domains instead of starting from scratch. In general, this paper explores Transfer Learning from a domain-adaptation point of view. This is possible in the form of continued pre-training of the existing weights, fine-tuning a few selected layers for a new task/domain, or through in-context learning which tries to localize and identify the relevant embedding space by using the additional context from the prompt. In addition, since we are focusing on domain-specific language, we will further evaluate how model performance differs when the model is tasked to summarize documents with a lexical corpus different from what is available in its pre-training process, compared to the performance observed after the fine-tuning procedure. Moreover, recent work ([@hu2021lora; @mao2022unipelt]) has been successful at exploring a more parameter-efficient method of domain adaptation which we would like to explore, but leave as a future work direction, sticking to the traditional approach with the hyperparameters detailed in [sec:appendix-training-details]{reference-type="autoref" reference="sec:appendix-training-details"}.

Related Work

Benchmarking LLMs is not a novel idea, however, after a thorough literature review, we found existing publications either to be too broad for our intended goal or focused on a parallel aspect. Furthermore, to the best of our knowledge, these models have not been benchmarked on a domain-specific text summarization task, thus we intend to evaluate if these models are suited for those who are dependent on the specificity of their data and its overall length. This paper should provide a uniform overview of what models perform best in this scenario. We will proceed to mention some of the publications that inspired our work.

Long Range Arena (LRA) [@tay2020long]. Widely accepted as a significant contribution, particularly due to the growing number of efficient transformer models being introduced and the need to assess their performance. Although LRA is extensive, we feel that it is lacking in the sense that it only covers datasets related to general reasoning tasks, such as the hierarchical mathematical reasoning dataset ListOps [@nangia2018listops] and image classification using the CIFAR-10 dataset [@Krizhevsky2009LearningML]. Additionally, the benchmark only covers the encoder-based model. While this is helpful in capturing the models' general scope of understanding and generalizing, it fails to focus on the language generation capabilities of the models, which is our main concern.

SCROLLS [@shaham2022scrolls]. The Benchmark, focusing on the overall Natural Language Generation capabilities of LLMs, is the most similar to our research. It attempts to benchmark the performance of Efficient Transformers in tasks similar to the ones used in pre-training, such as span corruption from the original T5 model [@Raffel2020t5]. While the SCROLLS paper focuses on a variety of tasks, we focus only on the summarization task, as it holds relevance for several industry-related use cases. Additionally, the SCROLLS benchmark evaluates only the Efficient Transformers with long-range capabilities, whereas we also include the latest LLMs which have surged in popularity.

An Examination of Large Language Models [@zhao2023survey]. A survey following the development and significance of large language models (LLMs). Tracing the progression from statistical language models to today's sophisticated LLMs, it aligns with the historic relevance and evolution of our study. The survey places emphasis on the unanticipated emerging capabilities of LLMs, such as in-context learning, which are non-existent in their smaller counterparts, aligning with our attempt to study how increased size improves summarization performance.

Benchmark {#sec-benchmark}

Datasets

To evaluate the performance of each model and how it varies given different context lengths, we have selected three datasets given the specificity of their domains and overall general features. Furthermore, below is a brief summary of each, along with a detailed length analysis in [tab:dataset_size_analysis]{reference-type="autoref" reference="tab:dataset_size_analysis"}.

arXiv [@cohan2018discourseaware]. Based on scientific articles from the arXiv platform, this dataset uses abstracts as a reference summary which ensures high-quality human-written summaries. In addition, as articles are often long and come from a complex lexical domain, they present themselves as an ideal medium for the long-range context transformer evaluation we intend to accomplish.

PubMed [@cohan2018discourseaware]. Similarly to arXiv, PubMed focuses on the scientific domain, albeit with a much narrower scope, focusing only on medical publications. All in all, we include it in the benchmark despite sharing the same structure with arXiv, in the sense that we also aim to evaluate these models' domain-adaptation ability.

GovReport [@huang2021efficient]. Stemming from the reports of government meetings, GovReport is an interesting addition to the benchmark as both the summaries and original texts are significantly longer than the other datasets, as observed in table 1{reference-type="ref" reference="tab:dataset_size_analysis"}. Moreover, per the authors, GovReport summaries source the relevant bigrams from a larger portion of the original text compared to the other datasets, further enabling our analysis of the relationship between model performance and encoding length.

::: {#tab:dataset_size_analysis} Dataset # Doc # W # Sum W


arXiv 215,913 6029.9 272.7 PubMed 133,215 3049.9 204.4 GovReport 19,466 9409.4 553.4

: Dataset Size Analysis. Where relevant, averages are reported for each dataset. # Doc refers to the number of documents, # W and # Sum W refers to the number of words in the original text and summaries, respectively. :::

Preprocessing and filtering

In order to ensure quality and consistency, we reproduce the SCROLLS [@shaham2022scrolls] preprocessing procedure by removing samples meeting the following criteria:

  1. The summary text is longer than half of the original text.

  2. The original text is a thousand times longer than the summary.

  3. The summary exists verbatim in the original text.

Additionally, and as is to be expected, this removed only a small number of samples given the datasets' inherent quality and prefiltering performed by their authors. Nonetheless, further details on the number of removed samples can be found in table 2{reference-type="ref" reference="tab:preprocess_stats"}, where we can verify that at most 4% of the samples were removed, a small enough percentage that we argue the datasets' overall characteristics were maintained.

::: {#tab:preprocess_stats} # Samples


2-3 Dataset Train Del % Del arXiv 203,037 6253 3% PubMed 119,924 4439 4% GovReport 17,517 63 0.4%

: Preprocessing statistics. We report the number of samples in the training split of the dataset before and after the preprocessing procedure, along with the percentage of samples removed. :::

Models

As per the motivation given in the background and related work sections, and given the large number of tokens in our datasets, we have chosen models able to handle these samples efficiently. Moreover, we think our selection should reflect the release timeline of these new architectures to illustrate progress and the expressiveness of the benchmark.

With these thoughts in mind, we have chosen BART [@lewis2019bart] as a baseline model and compared it with BigBirdPegasus [@zaheer2021big] and PegasusX [@phang2022investigating], both possessing long-range capabilities. Additionally, we compare these representative models with state-of-the-art LLMs including LLaMA [@touvron2023llama] and its derivatives vicuna, chatGPT with GPT 3.5 [@openai2022gpt3.5] as the backbone and lastly Falcon [@falcon40b]. Since all of these models are much different in size and architecture, we tried to optimize each model to be the best version of itself while following the usability guidelines. We discuss all these models in their respective subsections below, but we have also summarized the models in [fig:llms-taxonomy]{reference-type="autoref" reference="fig:llms-taxonomy"}.

::: figure* image{width="\textwidth"} :::

BART

@lewis2019bart is a combination of two ideas and architectures that followed the original transformer proposal. For the encoder, it makes use of a BERT-style [@devlin2019bert] procedure, obtaining embeddings by reconstructing masked-out tokens in the input sentence. Meanwhile, the decoder segment is identical to the GPT-like decoder found in most LLMs.

Furthermore, due to its early popularity as a summarization model for short-form text like news articles in XSUM [@narayan2018dont], we felt it was natural to include it as a baseline for the evaluation of other contemporary models.

BigBirdPegasus

@zaheer2021big appears as a modification of the attention module proposed by @ainslie2020etc with the inclusion of randomness in the attention pattern, allowing select tokens to randomly attend to others. Furthermore, as demonstrated theoretically by the authors, this pattern serves as an approximation to the full attention matrix while preserving linearity with respect to the input size.

Moreover, the model itself is akin to a Pegasus model, the differentiating factor remains the special attention module introduced here. We choose to include BigBirdPegasus due to it being one of the first models in the efficient transformer class that claimed state-of-the-art results when it was first published.

PegasusX

@phang2022investigating perform an extensive investigation of how to best adapt transformer models to long sequence data. Among other issues, they investigate whether an adaptation is more successful by performing additional pretraining over large documents, only using these large documents for pretraining or disregarding them entirely until fine-tuning for downstream tasks, finding that these models benefit from further pretraining even if it's only for a relatively small portion of the training samples.

Furthermore, the authors suggest a variation of the local attention architecture pattern we have discussed before: by padding the blockwise attention by half a block in every other layer, they effectively can introduce dependencies between blocks that would otherwise be self-contained while not increasing the implementation complexity. Together with the global tokens, this attention architecture allows the model to perform competitively in both short and long-sequence summarizations.

GPT-3.5

A major revelation in the current LLM landscape is the instruction fine-tuning approach that led to the explosion in popularity of the ChatGPT^2 platform and its model predecessor, InstructGPT [@ouyang2022training]. By leveraging Reinforcement Learning from Human Feedback (RLHF), as introduced in @ziegler2020finetuning, these models can follow arbitrary instructions, making them suitable for a downstream summarization task. Nevertheless, this model has a large performance bottleneck in its small context length, allowing it to encode only up to 4k tokens.

In this publication, we are using the version based on GPT-3.5, since we have not been given access to the larger and more powerful GPT-4 version. Although the architecture of this model is private and we cannot accurately compare it to models of the same size, we felt that its inclusion in our evaluation suite is natural as it represents the best contemporary capabilities of (assumed) reasonably sized models.

LLaMa and Derivatives

The LLaMa [@touvron2023llama] family of language models was introduced as a competing foundational LLM to the GPT family. We provide evaluation data on the 7 and 13 billion parameter versions to further demonstrate different summarization performances across different model sizes.

Moreover, a direct comparison to GPT-3.5 and the remaining Seq2Seq models would be unfair given the lack of any instruction-fine-tuning on the LLaMa models. To this effect, we also evaluate Vicuna [@vicuna2023], a model derived from LLaMa by fine-tuning it on data collected from user conversations with the ChatGPT platform, a method that has proven incredibly effective at instruction-fine-tuning. Other reasonable options for instruction-fine-tuned LLaMa derivates might as well be Alpaca [@alpaca] and WizardLM [@xu2023wizardlm], which are derived from different fine-tuning datasets. We choose Vicuna since it promises better performance on reasoning benchmarks such as MMLU [@hendrycks2021measuring], HellaSwag [@zellers2019hellaswag], and the AI2 Reasoning Challenge [@clark2018think].

Also, as is the case with the above model, LLaMa is only capable of handling up to 2K tokens of context, making it extremely handicapped in a long-document summarization situation.

Falcon

Falcon-40B [@falcon40b] is a new entry into the LLM space. It does not bring breakthrough innovations when compared to LLaMa, however, it demonstrates impressive comprehensive abilities, even outperforming LLaMa's 65B version on the benchmarks described above.

Their differences come mostly from the training data used. This model has been trained on a portion of the RefinedWeb [@penedo2023refinedweb] dataset augmented with curated text inspired by The Pile [@gao2020pile], while LLaMa uses a dataset which, albeit detailed in the original publication, has not been publicly released.

Finally, for evaluation, we use the instruction fine-tuned version of Falcon with both 7 and 40 billion parameters, which, akin to the above model, suffers from a limited 2k tokens context window.

Metrics

While there has been much discussion on the appropriateness of the Rouge [@lin-2004-rouge] score for automatic evaluation of summarization systems ([@10.1162/tacl_a_00373; @graham-2015-evaluating; @ng2015better]), mostly due to it being n-gram based and thus not dealing properly with different expressions conveying the same sentiment, it is still the most (and only) reported metric in new model publications and benchmarks.

This is mostly due to the lack of superior alternatives with METEOR [@banerjee-lavie-2005-meteor] and BLEU [@papineni-etal-2002-bleu] suffering from the same n-gram-based fate of failing to capture paraphrases. On the other hand, the recently proposed BERTScore [@zhang2020bertscore] avoids this problem by computing embedding similarity between generated and original texts.

Nevertheless, according to the findings in @koto-etal-2021-evaluating, the correlation between BERTScore and human evaluation of generated summaries for English text is similar to Rouge. As a result, we have opted to focus on the established Rouge, rather than BERTScore. We report both the obtained ROUGE-1, ROUGE-2, ROUGE-L scores and the geometric mean between ROUGE-1, ROUGE-2, and ROUGE-L, similar to the procedure in other publications.

Experiments {#sec-methodology}

As proposed, we evaluate the above models on the previously described datasets. With respect to the models, we first create a distinction between the models that are meant to be fine-tuned and the ones that are to be used out of the box.

In the section below, we provide technical details and model configurations related to fine-tuning and inference.

Fine-tuning

Given the input size limitations, the vanilla Seq2seq BART is fine-tuned on its maximum input context of 1024. The Efficient transformer BigBirdPegasus is fine-tuned to its maximum input length of 4096 tokens. PEGASUS-X, which supports up to 16384 tokens is fine-tuned on 4096 tokens as well as 8192 tokens to evaluate the effect of longer context on the abstract summarization task. We fine-tuned all the Seq2Seq models for a number of epochs dependent on dataset size and convergence level. Further details can be found in [sec:appendix-training-details]{reference-type="autoref" reference="sec:appendix-training-details"}. After fine-tuning, we perform inference and use the corresponding ROUGE score for the final evaluation.

Inference

In order to evaluate the models' performance, we run inference in a Seq2Seq fashion after the fine-tuning procedure for the Efficient Transformer class.

Inference in the LLM models is not trivial since the usual fine-tuning is too computationally demanding and the usual in-context learning paradigm is not suited for the summarization task. Even a single document doesn't fit in the whole context window, making it impossible to provide an example sample. Given the above reasoning, we decide to evaluate these LLMs by prompting them to summarize the provided content appropriately. More details can be found in [sec:appendix-training-details]{reference-type="autoref" reference="sec:appendix-training-details"}.

Results and Discussion

As explained in the experiments section, we distinguish models that should be fine-tuned and those that present good results as-is. By fine-tuning BART, BigBirdPegasus, and PEGASUS-X with different configurations, we have obtained different versions of the models for our evaluation purposes. We also make use of the original model weights without any fine-tuning for analysis. For the remaining LLMs that were meant to be used out-of-the-box, we performed direct inference.

Additionally, we have reported the sample summaries generated by some of the models for the same input text in [[appendix:sample outout]](#appendix:sample outout){reference-type="autoref" reference="appendix:sample outout"}. While we use the ROUGE score as the main indicator of performance, this appendix section provides some additional insight into the model's performance than the one provided by automatic evaluation.

We report results with both ROUGE-1, ROUGE-2, ROUGE-L and the geometric mean of ROUGE-{1,2,L} for all models evaluated with the three datasets detailed previously. While we discuss the key findings from our experiments in the later part of this section, the results are summarized in [tab:rouge1-score-inference]{reference-type="autoref" reference="tab:rouge1-score-inference"}.

Efficient Transformers remain competitive via fine-tuning: from a bird's eye view, it is clear that the Efficient Transformers, namely BigBird-Pegasus, and PEGASUS-X, are clear winners as they consistently perform better in terms of ROUGE scores. These are impressive results given the much smaller size and computational requirements of these models, as compared to the state-of-the-art LLMs. Furthermore, as evident in [[appendix:sample outout]](#appendix:sample outout){reference-type="autoref" reference="appendix:sample outout"}, the summaries generated by PEGASUS-X and BigBird-Pegasus, essentially the seq2seq models fine-tuned on the same domain, produce summaries that are more in line with the technical language of the paper. Whereas the ones generated by LLMs like chatGPT use simpler words in the summaries. However, we cannot neglect the additional effort and costs required due to the need for fine-tuning over a specific dataset, as models without fine-tuning perform much worse than their fine-tuned counterparts. Nevertheless, for an industrial or production setting, a smaller model like an Efficient Transformer might be a better choice.

Longer Context Windows have their downsides: for the models that support larger context windows such as PEGASUS-X and GPT-3.5, scaling the context window to 16k does increase their ROUGE scores, albeit only marginally in most cases. A possible explanation for this phenomenon is that the relevant text for a high-quality summarization isn't evenly distributed in the source document, thus further context has diminishing returns. Furthermore, given the fact that increasing the context window length directly increases the training/inference time as well as memory requirements, we can argue that in light of the marginally better ROUGE scores, for resource-constrained environments and particular dataset distributions, scaling the input length may not be the ideal choice.

Bigger Models do perform better: while it is a known fact in the LLM community that bigger models perform better up to a certain degree, we confirm this to be the case in our limited experiment set. We compare two of the most prominent open-source models, LLaMa (7b vs 13b) and Falcon (7b vs 40b) and, as expected, the larger variant performs better in both cases. Additionally, GPT-3.5 outperforms both Falcon and LLaMa models. While the exact size of GPT-3.5 is unknown, we do know that GPT-3 has 175B parameters and therefore assume the 3.5 variant to be, at least, bigger than Falcon's 40B parameters.

GPT-3.5 outperforms other LLMs: among all the LLMs in our domain-specific text summarization study, GPT-3.5 with a 16k context window seems to perform the best in terms of ROUGE score. Although we used only a portion of the full datasets, given the use of random sampling (more details in [sec:appendix-training-details]{reference-type="autoref" reference="sec:appendix-training-details"}), reported scores should be indicative of model performance on the overall datasets. Concluding, while the others are competitive, this model emerges as a strong and versatile option for summarization applications, despite the privacy concerns related to its closed-source nature.

::: table* :::

Limitations {#limitations .unnumbered}

Despite our best attempt to provide an overview of LLMs with regard to their ability to understand domain-specific text, several dimensions of the study could not be explored. A major cause for this is the hardware restrictions. Although we had access to high-quality hardware, its availability was scarce, forcing us to use only one or two GPUs at a time. This limitation made it so we could not test the larger LLMs which promise the best overall performance in other tasks than summarization.

Another hindrance from the lack of hardware availability: we intended to evaluate performance using the latest domain-adaptation methods, such as adapters [@houlsby2019parameterefficient] and LORAs [@hu2021lora] that make it possible to fine-tune these large models on downstream tasks. Exploring this paradigm would be ideal since the usual LLM in-context learning is impossible for long-document summarization: the size of the documents makes it so even one document is hard to fit in the predefined model context length, therefore providing more examples for guidance is impossible.

On the other hand, we also would like to include GPT-4 [@openai2023gpt4] as the latest and greatest LLM but its (current) exclusive API access and large associated costs were prohibitive. Together with its maximum 32k context length and human-level comprehensive abilities, we imagine this model to have very competitive performance with the finetuned Seq2Seq models, all without the need for an expensive training step and for deploying several models for various downstream tasks. This is illustrated by the impressive performance of GPT-3.5 with a 16k context length.

Finally, we mention the lack of expressiveness in the ROUGE metric which is not ideal for an abstractive summarization setting. We have mentioned before how it is a poor proxy of human perception of summarization quality, which is shown by the high ROUGE scores of the standard BART model without any fine-tuning. Inspecting the model's outputs, we notice how often it simply repeats the original text. This coincidentally is similar to summaries, given that the introduction section usually provides a reasonable overview of the text. In the future, we hope to leverage new metrics that are more in line with what humans perceive as high-quality summaries. Additionally, we also wish to study the effectiveness of these automatic evaluation scores by using human evaluation as a baseline.

Ethics Statement {#ethics-statement .unnumbered}

Throughout our experiments, we strictly adhere to the ACL Code of Ethics. Since we used already established open-source benchmark datasets, the concern of privacy does not apply. Furthermore, since no additional data was collected or stored, and no human annotators were used in the experiment, we minimized the risk of prejudice. Through our fine-tuning strategies, no additional bias was introduced into the models, other than what might already be part of the model weights or the benchmark dataset. The goal of the research was to evaluate the text summarization capabilities of existing models. The results and discussions in this paper are meant to further promote research in the area of domain-specific language modeling with an over-arching goal of bridging the gap between academia and application. All training scripts and trained models will be made available to the research community.

Acknowledgements {#acknowledgements .unnumbered}

Training Details {#sec:appendix-training-details}

Training

The fine-tuning procedure was done by leveraging 2 Nvidia A100-80GB GPUs, relying on the HuggingFace Transformers [@wolf-etal-2020-transformers] and Microsoft Deepspeed^3 libraries for distributed training. Furthermore, we plan on releasing the fine-tuned models along with the codebase used in our study.

Moreover, hyperparameters for the above training run are described in [table:hyperparameters]{reference-type="autoref" reference="table:hyperparameters"}, and the configuration for Deepspeed Stage 2 can be found in [table:deepspeed]{reference-type="autoref" reference="table:deepspeed"}. In this setting, all values set to auto are automatically filled by the HuggingFace Trainer according to the user-provided or default values if none are set.

::: table* Dataset Batch Size Learning Rate Epochs Input Tokens Gen. Tokens Beam Size


BART
arXiv 128 8e-4 4 1024 256 1 PubMed 128 8e-4 4 1024 256 1 GovReport 128 8e-4 8 1024 1024 1 BigBirdPegasus
arXiv 64 8e-4 4 4096 256 1 PubMed 64 8e-4 4 4096 256 1 GovReport 64 8e-4 8 4096 1024 1 PEGASUS-X
arXiv 64 8e-4 4 4096 / 8192 256 1 PubMed 64 8e-4 4 4096 / 8192 256 1 GovReport 64 8e-4 8 4096 / 8192 1024 1 :::

::: table* Key Value


bf16.enabled auto optimizer.type AdamW optimizer.params.lr auto optimizer.params.betas auto optimizer.params.eps auto optimizer.params.weight_decay auto scheduler.type WarmupLR scheduler.params.warmup_min_lr auto scheduler.params.warmup_max_lr auto scheduler.params.warmup_num_steps auto zero_optimization.stage 2 zero_optimization.offload_optimizer.device cpu zero_optimization.offload_optimizer.pin_memory true zero_optimization.allgather_partitions true zero_optimization.allgather_bucket_size 2e8 zero_optimization.overlap_comm true zero_optimization.reduce_scatter true zero_optimization.reduce_bucket_size 2e8 zero_optimization.contiguous_gradients true gradient_accumulation_steps auto gradient_clipping auto steps_per_print 2000 train_batch_size auto train_micro_batch_size_per_gpu auto wall_clock_breakdown false zero_allow_untested_optimizer true :::

Inference

For inference, we rely on a single Nvidia A100-80GB which is capable of handling our models in the bfloat16 format. The one exception is Falcon-40B which required loading the model in an 8bit quantized fashion utilizing the bitsandbytes [@dettmers2022llmint8] library, we consider possible performance losses due to this approach mostly insignificant as the obtained ROUGE scores lie in the expected range. The GPT-3.5 model was evaluated using the API made available from OpenAI^4, where we utilized the latest snapshot available, in this case, gpt-3.5-turbo-0613 dated to June 13th, 2023.

As was the case for the training procedure, we sample a maximum of 256 tokens for the arXiv and PubMed datasets, while scaling to 1024 tokens for the GovReport dataset, as is standard procedure in other contemporary publications.

Also, regarding the prompt utilized to perform inference on the open-source LLM models, we show it in [fig:prompt_template]{reference-type="autoref" reference="fig:prompt_template"}. In order to make sure the "SUMMARY: " portion is included in the context window, we always force it as a suffix as the last input tokens.

Finally, for the GPT model, we used the system message, "You are an expert at summarization. Proceed to summarize the following text", followed by the maximum portion of the source document able to fit in the model's context window. What's more, with the 16k context-length experiments we could only experiment with a sample of the overall datasets. We set the randomization seed in PyTorch to $1337$ to ensure our experiments are reproducible and promptly select a 10% slice of the full dataset for analysis.

::: figure* image{width="\linewidth"} :::

Sample Output {#appendix:sample outout}

Summaries for the sample (idx 42) from the test split of pubmed dataset. PEGASUS-X fine-tuned on 8192 tokens: multiple endocrine neoplasia ( men ) type 2a and b syndromes are rare but significant endocrine diseases that are increasingly encountered by pediatricians. mtc is a relatively slow - growing tumor but, if metastasized or relapsed, it becomes very aggressive causing more than 13% of all thyroid cancer - related mortality. mtc is usually diagnosed in the childhood, and the majority of mtc diagnosed in the childhood is hereditary. mtc diagnosed during childhood almost always results from a dominantly inherited or de novo activating mutations in the ret proto - oncogene, which encodes the ret receptor tyrosine kinase. advances in predictive genetic testing for ret mutations have enabled early diagnosis of hereditary men syndromes and prophylactic thyroidectomy in presymptomatic patients to prevent mtc. in this review, we discuss the etiology of pediatric mtc and currently available therapeutic modality for the cancer. \n GPT 3.5 direct inference on 4192 tokens:

Thyroid cancer is a common type of cancer that accounts for about 1% of all cancers. Medullary thyroid cancer (MTC) is a type of thyroid cancer that is slow-growing but can become aggressive if it spreads or relapses. It is more common in children and is usually hereditary. MTC is caused by mutations in the RET proto-oncogene. Early diagnosis of hereditary MTC through genetic testing has enabled preventive measures such as thyroidectomy in presymptomatic patients. The main treatment for MTC is surgical removal of the thyroid, and adjuvant treatment options are limited. However, newer targeted therapies such as vandetanib and cabozantinib have shown promise in clinical trials. Prophylactic thyroidectomy at an early age is recommended for children with hereditary MTC to reduce the risk of developing aggressive MTC.

BigBird-Pegasus fine-tuned on 4192 tokens:

thyroid cancer ( mtc ) is the most common endocrine neoplasia. mtc is characterized by the development of a highly penetrant, autosomal dominant, and highly aggressive form of mtc. the most common genetic mutation in mtc is the ret mutation. the men2a subtype is characterized by the early development of an aggressive form of mtc in all affected individuals, typically during the first year of life. the men2b subtype is characterized by the early development of an aggressive form of mtc in all affected individuals, typically during the first year of life. the men2a subtype, accounting for 9095% of the men type 2 cases, is a highly penetrant, autosomal dominant endocrine tumor syndrome characterized by the development of mtc in > 90% of ret mutation carriers. the men2b subtype, accounting for approximately 510% of the men type 2 cases, is characterized by the early development of an aggressive form of mtc in all affected individuals, typically during the first year of life. the men2a subtype, accounting for 9095% of the men type 2 cases, is closely associated with men2a, demanding a genetic screening for men2 \n'