Introduction
The task of text simplification (TS) is usually posed as: from a source sentence, rewrite it in a way that makes it easier to read, using techniques ranging from lexical simplification to reformulating the grammatical structure of the sentence, paraphrasing wherever need to be. The purpose of this task is to facilitate the reading and understanding of possibly complicated texts for readers who suffer from cognitive disabilities, namely aphasia [@article], dyslexia [@10.1145/2461121.2461126], autism [@evans-etal-2014-evaluation], or even second language learners [@10.5555/3016387.3016433].
Furthermore, with governments passing legislation requiring legal and public texts to be available in simplified language [@Maass2020], this field is gaining more and more traction as suddenly public entities are forced to translate their entire available information base to a simplified language.
Given that this is an active area of research, our initial focus was on emulating key publications and, possibly, try to build on their work and improve it. During the course of the semester, we have focused mainly on general rewriting of the inputs provided through the fine-tuning of large language models, controlling this rewriting through explicit control tokens and, in a more narrow scope, identifying and replacing possibly lexically complex words in a pre- or post-processing step.
We achieved moderate success in this task as we got close to the state of the art (43.05 SARI [@xu-etal-2016-optimizing] on the ASSET [@alva-manchego-etal-2020-asset] dataset). Nonetheless, there is still definitive room for improvement as some topics were left unexplored either due to lack of time, or lack of compute power required to train these large models. Throughout the report, and in a following section on future work, we thoroughly discuss what we consider the bottlenecks in our system to be, and possible ways to tackle them.
Related Work
In this section, we describe some of the research conducted. Mainly, in the Sentence Simplification, Controllable Text Generation and Lexical Simplification areas.
Sentence Simplification
Sentence simplification can be seen as a monolingual machine translation task, where models are trained with aligned pairs of sentences obtained, for example, from Wikipedia articles and their corresponding Simple Wikipedia versions [@zhu-etal-2010; @Wubben-etal-2012].
To this purpose, there was some focus on Statistical Machine Translation models [@zhu-etal-2010] but those have since been overcome by large neural models. Among several others: @Nisioi1-2017-neural used a vanilla recurrent neural network for text simplification; @zhang-lapata-2017-sentence combined RNNs and reinforcement learning; @https://doi.org/10.48550/arXiv.1810.11193 introduced the transformer architecture for TS and integrated it with a paraphrase dataset [@Pavlick-2016-PPDB]; @Dong-2019-EditNTS presented EditNTS, a reinforcement learning model that learns ADD, DELETE and KEEP operations to simplify text.
Controllable Text Generation
In the last few years, controllable text generation with conditional training of Seq2Seq models has been applied to different NLP tasks such as summarization [@https://doi.org/10.48550/arXiv.1711.05217], politeness in machine translation [@Sennrich-2016], sentence compression [@https://doi.org/10.48550/arXiv.1809.02669; @Mallinson-2018], along with others.
In the text simplification task, @Scarton-2018 pioneered this idea of controllable tokens to generate simplified sentences by embedding a grade level token into a Seq2Seq model. ACCESS [@https://doi.org/10.48550/arXiv.1910.02677] leveraged a similar approach by conditioning the sentences on number of characters, character-level Levenshtein similarity, word frequency and syntactic complexity.
Furthermore, building on ACCESS, @https://doi.org/10.48550/arxiv.2005.00352 presented MUSS [@https://doi.org/10.48550/arxiv.2005.00352] an unsupervised multilingual model that fine-tunes a BART language model instead of the original Transformer architecture. Recently, @sheang-saggion-2021-controllable reiterated this idea, but with the newer T5 [@https://doi.org/10.48550/arxiv.1910.10683] model and an additional token encoding the number of words ratio between sentences, becoming the current state of the art.
Lexical Simplification
Lexical simplification (LS) is the task of identifying complex words and finding the best candidates to replace them. Early studies on Complex Word Identification (CWI) usually identified complex words based on some word frequency threshold [@Biran-2011], word length [@Bautista-2009], or even attempting to simplify all words [@Thomas-2012]. However, @Horn-2014 showed that this last approach may ignore a large portion of complex words due to its inability to find simpler alternatives.
Moreover, as @Shardlow-2013 states that the simplify-all approach might result in distorted meanings and the more resource-intensive threshold-based approach does not necessarily perform better, novel approaches have been presented [@gooding-kochmar-2019-complex] to perform CWI conditioned on the context words lie in, in a sequence modelling fashion.
Detailing complex word replacement, previously rule-based systems usually replace an identified word by it most frequent synonym in WordNet [@Thomas-2012]. However, more recent approaches are using contextualized word vectors [@Qiang-etal-2020], leveraging the power of the powerful neural models currently available.
Datasets
Fortunately, when considering text simplification for English sentences, there exists a large amount of data publicly available, mainly due to the existence of a Simple English version of Wikipedia^1 and educational article sources such as Newsela^2.
Besides, there have been numerous efforts aligning these articles, providing the research community with high quality simplification datasets for the English language. Among them, we have WikiLarge [@zhang-lapata-2017-sentence] and WikiAuto [@https://doi.org/10.48550/arxiv.2005.02324] which differ in the way they were constructed and overall size.
Additionally, in order to augment the training dataset, we made use of a paraphrase dataset entitled OpusParcus [@https://doi.org/10.48550/arxiv.1809.06142]: it uses differently authored subtitles for movies and TV shows, resulting in aligned paraphrases for scenes with the same meaning.
Moreover, there are two test datasets that are commonly discussed in the literature: TurkCorpus [@xu-etal-2016-optimizing] and the ASSET [@alva-manchego-etal-2020-asset] dataset. These last two source the same original sentences, but resort to different simplification techniques in the crowdsourced references they provide. While TurkCorpus only allows for rewriting the original sentence, ASSET is less restrictive and makes it possible to delete expressions or split the ideas over several sentences.
On the other hand, for the lexical simplification approach, we utilized the English dataset from the Complex Word Identification 2018 Shared Task [@Yimam-2018], called CWIG3G2 [@yimam-etal-2017-cwig3g2].
Concluding, in table 1{reference-type="ref" reference="dataset_sizes"}, we can take note of each dataset's size.
Newsela
Featuring the Newsela dataset in our project would have been an excellent addition, since it consists of educational articles with professional simplifications. However, Newsela is a corporate entity and their data is not publicly available. We tried to get access to it but, unfortunately, we didn't receive a reply.
::: {#dataset_sizes} Name #Train #Val #Test
Wiki_Large 296.402 - - Wiki_Auto 604.000 - - TurkCorpus - 2.000 359 ASSET - 2.000 359 OpusParcus ~500.000 - - CWIG3G2 27.299 3328 4252
: Datasets. Train and test sets and the respective number of samples. :::
Evaluation of our system
In order to evaluate our models, we rely on three evaluation metrics: SARI, FKGL and BLEU. We compute them using EASSE [@https://doi.org/10.48550/arXiv.1908.04567], a simplification evaluation library in Python.
SARI
SARI [@xu-etal-2016-optimizing] compares system output against both references and the input sentences. It measures the goodness of words that are added, deleted and kept by the systems comparing the output of the simplification model to multiple references and the original sentence, using both n-gram precision and recall.
So far, SARI is the most commonly adopted metric for text simplification in English, and we use it as a reference score of the overall performance.
FKGL
In order to measure the readability of our systems, we use FKGL [@Kincaid1975DerivationON]. It is computed as a linear combination of sentence length and the number of syllables per word.
Although FKGL does not take into account grammaticality and meaning preservation [@Wubben-2012], it is one of the mostly used evaluation metrics for text simplification in English (only). However, due to this limitation it should not be used alone as an evaluation metric.
BLEU
Usually utilized as a metric to evaluate machine translation systems, BLEU [@papineni-etal-2002-bleu], is an N-gram based metric that is supposed to correlate with meaning preservation
Although it has been reported that BLEU doesn't necessarily correlate with the quality of simplifications ([@https://doi.org/10.48550/arxiv.2104.07560; @sulem-etal-2018-semantic]), we found that we should still report it in order to compare our model with other existing work.
Fine-Tuning Language Models for Simplification
Similar to a baseline step, we decided to train a basic Seq2Seq model with attention [@https://doi.org/10.48550/arxiv.1409.0473]. However, as it did not amount to good or even promising results, possibly due to implementation related problems, work in this direction was halted. Still, we report this model in our results under the name Baseline. Furthermore, for our initial experiments we settled on a subset of the larger WikiAuto dataset as we found it to be enough (circa 50000 sentences) for a proof-of-concept demonstration.
As a next step, and with a large amount of inspiration from the MUSS
paper [@https://doi.org/10.48550/arxiv.2005.00352], we set out to
fine-tune a BART [@https://doi.org/10.48550/arxiv.1910.13461] model on
the subset of WikiAuto we have mentioned above. We resorted to the
HuggingFace^3 transformers
package, which has all the necessary
tools to build and fine-tune our models. Specifically, we have made use
of the facebook/bart-base
model which is adequate to our computing
power: the training portion was conducted through an instance of Google
Colab Pro^4 as neither of us have a machine capable of training at
this model/dataset scale.
From there, we had built a model with a somewhat reasonable behavior, achieving a 38.59 SARI score on the ASSET dataset (more details and comparisons on our all of our model results will be discussed in a future section).
In addition to BART, we also experimented with the T5 [@https://doi.org/10.48550/arxiv.1910.10683] model, which should have displayed relatively better results. However, we encountered some overfitting issues in the fine-tuning procedure and tried some known fixes (freezing layers, dropout, etc...) to no avail.
Ranking candidate sentences
Finally, one other aspect to consider optimizing when it comes to these large language models generation ability, is that they are able to output several candidate sentences. With the use of beam-search to search over the most likely tokens, the model is able to keep track of what sentences observe the highest likelihood and drop them dynamically if a new, more probable, candidate arises. But then, how should we rank these candidates and output a singular, most simple sentence?
One simple heuristic is to compute the FKGL metric over the candidates and return the lowest score, which should indicate the simpler sentence overall. However, this completely disregards the semantic meaning in each sentence. In @https://doi.org/10.48550/arxiv.1901.10746, machine translation metrics such as METEOR [@banerjee-lavie-2005-meteor] and smoothed BLEU [@lin-och-2004-automatic] showed the largest correlation with meaning preservation. Furthermore, we would want to encourage grammatical structure simplification, which can be measured through SAMSA [@sulem-etal-2018-semantic]. Given all of these metrics and aspects to optimize, we believe that a harmonic compromise between them could be a plausible solution to the ranking problem.
On the other hand, we also experimented with another idea: drawing inspiration from @https://doi.org/10.48550/arxiv.2104.07560, we tried using a question generation and answering framework on the source and output sentence in order to evaluate their semantic meaning. For simplification, this requires an adaptation of QuestEval [@https://doi.org/10.48550/arxiv.2103.12693] that utilizes BERTScore [@https://doi.org/10.48550/arxiv.1904.09675] instead of just plain F1 score of the provided answers. Since it's encouraged for the simplification model to replace words with synonyms, we expect their respective BERT [@https://doi.org/10.48550/arxiv.1810.04805] embeddings to be similar.
What we found however is that we introduced a significant overhead in our model's inference time and didn't gain much in terms of results. Using a threshold approach on the candidate sentence's BERTScore, we only managed to achieve the same SARI result we previously had once the threshold had gotten low enough that the output was identical to the one provided by FKGL ranking. Therefore, given the worst efficiency and poor results, we disregarded this question generation/answering idea from our final system.
Paraphrase dataset augmentation
Closing the fine-tuning section, we should discuss another experiment we have conducted: MUSS [@https://doi.org/10.48550/arxiv.2005.00352], one of the most recent successful papers in the TS field, augments the WikiLarge dataset with paraphrase data, easing the rewriting of phrases and expressions. For their work, an unsupervised method was devised for aligning sliding windows of text scraped from the web, computing a similarity score between them for alignment. In the end, they constructed an unsupervised paraphrase dataset of circa 1 million samples, something unfeasible for us to reproduce.
However, instead of gathering the dataset ourselves, we tried to arrange a paraphrase dataset and amplify WikiLarge with further sentence rewriting examples. To this effect, we chose a subset of OpusParcus [@https://doi.org/10.48550/arxiv.1809.06142], resulting in an augmented training dataset of ~500000 aligned sentences. Following a large amount of training time, this idea (which still sounds promising) failed to produce an exciting SARI score, which prompted us to not investigate further due to time constraints.
Encoding Simplification With Explicit Tokens
In addition to general fine-tuning, we wanted to control sentence simplification attributes using explicit control tokens. First, we compute them for every sentence pair in the training set and inject the ratio between source and target in the source sentence, such that the model should learn to condition the output on the ratios. As a measure of simplicity, we compute the following tokens:
#Chars <c_xx>
: the number of characters ratio between source and
target sentences. This control token provides information about sentence
compression.
LevSim <lev_xx>
: normalized character-level Levenshtein similarity
[@Levenshtein-1966] between the source and target. Quantifies how
different the target is to the source sentence.
WordRank <rank_xx>
: inverse frequency order of all words in the
target divided by that of the source. Word frequency are indicators of
word complexity.
DepTree <dep_xx>
: maximum depth of the dependency tree of the
target divided by that of the source. This token should provide
information about syntactic complexity.
#Words <rat_xx>
: number of words ratio between source sentence and
target sentence. The number of words in the target divided by that of
the source.
At inference time, we condition the generation by choosing the values most suitable to the degree of simplification required.
Hyperparameter search
At inference time, we have to choose fixed ratios that maximize the simplicity for the audience (measured by the SARI score). To this effect, we handcrafted different possible values for each token and tried out combinations between them. It is easy to see how this leads to an exponential runtime complexity, which is something we couldn't afford to run several times. Concluding, after trying out several combinations on the ASSET dataset, we settled on the values shown in table 2{reference-type="ref" reference="token_comb"}.
::: {#token_comb} #Chars LevSim WordRank DepTree #Words #Syl
c_0.8 lev_0.5 rank_0.8 dep_0.9 rat_0.9 n_syl_1.9
: Best token combination for the ASSET test set, training on WikiLarge :::
Finding new tokens
In addition to the tokens described in @sheang-saggion-2021-controllable, we explored new token ideas that could possibly increase our overall SARI score (42.94 originally). The following options were tested:
#Splits: the number of sentences ratio between source and target. This should provide information about the model's ability to split a long sentence and generating short sentences. Result: $\mathbf{42.43}$.
Average word length: the average word length ratio between the words in source and test sentences. Provides information about word compression. Result: $\mathbf{42.475}$.
#Syl: average number of syllables per word ratio. Should help readability. Result: $\mathbf{43.05}$.
More tokens could be tested, such as the number of polysyllables or the proportion of edited, deleted and added words, for example. However, these tokens would control the sentence compression and readability, which are already controlled to some degree.
Finally, since only one of these tokens produced better results, we only
introduce the novel n_syl
token and add it to our system.
Ablation study
In this section, we study the impact that each token has over the generated outputs. In figure [fig:feature_corr]{reference-type="ref" reference="fig:feature_corr"} we present how varying one feature in isolation impacts the others. The displayed results stem from computing these metrics over the ASSET test set outputs and averaging them over all the sentences.
It is observed that some tokens are strongly correlated and that some don't appear to have much influence on the majority of the outputs. For example, if we consider character compression and word ratio, Levenshtein similarity and dependency tree depth, we can clearly see how they are correlated: shorter sentences should observe more rewriting and fewer words overall, while also making an impact on the dependency tree depth. On the other hand, tokens like word complexity and the number of syllables don't seem to be having much effect, failing to vary the output metric by much.
Nonetheless, overall, the tokens seem to be mostly behaving as expected, showing more or less a linear increase if we look at each feature's data separately.
Lexical Simplification
Once the simplification model performed up to a decent standard, it was time to experiment with explicit lexical simplification, setting up a pipeline that tried to simplify any complex words. In turn, this implied having two separate models: one for the identification of what words to replace and one for the actual replacement.
Complex Word Identification
Initially, as a baseline in identifying if a word is complex or not, we tried out a simple heuristic: with access to the Zipf [@zipf_1949] word frequencies in a large corpus of text, we could go over each word in the provided input sentence and mark it as complex if its frequency was below a certain threshold. Zipf's law is a power law model of how words are distributed in a specific set of texts.
This approach works reasonably well, but it still has its shortcomings. For example, it fails to consider a word's context, which was utilized by @gooding-kochmar-2019-complex in order to improve the same CWI task. Thus, following the paper's footsteps, we decided to model this problem as a sequence classification task: we prefix a sentence (similar to the token idea we have described in previous sections) with the word to be labeled and let our model predict if it's complex or not.
Moreover, we should discuss what it entails for a word to be complex. Most of the research available poses the CWI problem as a binary classification task, but complexity is more fine-grained than a binary label. A word isn't equally complex in the eyes of both a native and non-native speaker. In this direction, in recent years there have been some efforts in building regression models that are able to capture these different views. This last task was the objective of the 2021 SemEval Lexical Complexity Prediction Shared Task [@shardlow-etal-2021-semeval].
Similarly, we adapted our code to perform regression on the CWI dataset from @yimam-etal-2017-cwig3g2 that used binary complex annotation for our regression task: however, the dataset has the number of annotators who found the word complex and the number of annotators overall, making it possible for us to model the probability of a word being complex as the regression target.
For the regressor, we resourced the DistilBert
[@https://doi.org/10.48550/arxiv.1910.01108] model available in the
transformers
package and utilized it in a sequence classification
fashion but with a single class, effectively performing regression.
After fine-tuning on the binary CWI dataset, we had constructed a model
that performed quite well, achieving a Mean Average Error of 0.052 on
the test portion of the dataset, even visually evaluating it (see table
[table:cwi]{reference-type="ref"
reference="table:cwi"}), we can see that it behaves as expected. The
benefit of regression is that it allows us to set a threshold that
better models our data and, after some tweaking, we ended up settling
for a value of 0.2.
::: table* Use HTML and CSS ; ; and only with good reason.
Stallone also had an ; appearance in the 2003 French film Taxi 3 as a passenger. A fee is the price one pays as ; for services, especially the ; paid to a doctor, lawyer, ;, or other member of a learned ;. :::
Complex Word Replacement
First, we think that this portion of our system is its biggest bottleneck. Unsurprisingly so, since our time-constrained solution is very simple: after identifying which words to replace, we generate $n$ copies of the source sentence, where $n$ is equal to the amount of complex words; from there we mask each complex word in a corresponding sentence, leveraging DistilRoBERTa [@https://doi.org/10.48550/arxiv.1910.01108] to predict the most likely token given the context.
Now, this solution has some merit as the most likely word should be one that occurs very frequently in the associated context and is, therefore, simple. On the other hand, the reasoning behind replacing each word individually is related to damage control, since there are no guarantees that the substitute word is even remotely similar to the original. Particularly, when replacing nouns, the most likely scenario is for the model to find another, more common noun (e.g., neurosurgeon $\rightarrow$ businessman).
Summarizing, this is not the optimal setting for this task, but it was the one possible to implement with limited time.
Results
Finally, in table [results]{reference-type="ref"
reference="results"} we present the results of all developed models,
tested with the ASSET and TurkCorpus datasets and trained with WikiLarge
or a subset of WikiAuto. For each model we report the SARI, BLEU and
FKGL metrics. Here, our best results were produced by BART with all the
tokens available in literature, plus the novel n_syl
token.
We also observe disappointing results for T5-based models, especially since it is the main ingredient for the current state-of-the-art system [@sheang-saggion-2021-controllable]. Another interesting finding is that CWI decreases the SARI score, while displaying the best FKGL score, emphasizing the increased readability in detriment of meaning preservation.
Furthermore, we only performed the hyperparameter search over the ASSET dataset, which should imply that the token results for the TurkCorpus dataset should be taken with a grain of salt and could substantially improve. Plus, we disregarded testing all the features over the WikiAuto trained model, since it's expected to perform worse than its WikiLarge counterpart.
All in all, we got close to the sate-of-the-art with our best model (BART+Tokens+n_syl) achieving a SARI of 43.05 which should only improve with the suggestions from Future Work Section.
::: table* ::: tabular* p1.8cm p4.5cm p1.2cm p1.3cm p1.3cm p0.4cm p1.2cm p1.3cm p1.3cm
& & &&
&&SARI $\uparrow$ &BLEU $\uparrow$ &FKGL $\downarrow$ && SARI $\uparrow$
&BLEU $\uparrow$ &FKGL $\downarrow$
&Baseline & 23.221 & 0.008 & 5.825 && 17.830 & 0.004 & 35.298
&BART & 38.59 & 84.03 & 7.294 && 38.77 &73.38 & 7.08
& T5 & 36.15 & 89.98 & 8.37&& 38.14& 80.51&7.52
WikiLarge&T5+Tokens & 36.89 & 84.05 & 7.92&& 36.33 & 72.72&6.71
&BART+Tokens & 42.94 & 67.31 & 5.1 && 37.85&65.31&8.09
&BART+Tokens+nsyl & 43.05 & 74.64 & 5.72 && 38.60 & 61.15 & 5.72
&BART+Tokens+nsyl+CWI & 42.996 & 63.65 & 5.095 && 38.20 & 53.20 &
5.34
&Baseline & 25.637 & 0.019 & 7.439 && 15.950 & 0.015 & 15.732
WikiAuto &BART & 39.21 & 85.88 & 6.57 && 36.30 &79.44 & 7.61
& T5 & 37.654& 85.532& 6.994 && 37.339& 80.660&7.497
:::
:::
Future work
Finally, after having identified some problems, bottlenecks or potential areas of improvement impacting our system's performance, we will proceed to briefly discuss them, in order to facilitate the continuation of our work.
BRIO-like training procedure
To begin, we have come across the BRIO [@https://doi.org/10.48550/arxiv.2203.16804] model, that is currently the state-of-the-art in abstractive summarization. Due to the similarity between these two tasks, we think that replicating BRIO's properties will substantially increase our model's performance.
The authors have introduced a novel training procedure that combines contrastive loss and the regular MLE-based cross entropy loss we already utilize. Particularly, the model utilizes the contrastive loss with the assumption that text generation is not a "one correct answer" type of task and thus, assigns probability mass to the supposedly suboptimal candidate outputs.
As previously explained, BART-like models generate their candidate sentences in a token-level autoregressive fashion, using beam-search to limit the output space and, consequently, being able to generate several candidates. Given the MLE assumption, the first output is the most likely, since it produces a sentence similar to the provided reference. On the other hand, this line of thought is flawed since other candidates can be simple paraphrases, our desired outcome.
To combat this, the contrastive loss is based on an arbitrary metric suitable to the task at hand (in our case, SARI). This novel training procedure introduces the possibility of reordering the candidate sentences according to their simplicity, directly coordinating with the token-level generation task to improve the system's performance.
Furthermore, since we already investigated this ranking-type of approach after the fine-tuning procedure, it is reasonable that incorporation during training will improve our system.
From Single Complex Words to Multi-Word Expressions
In addition to the CWI and replacement, a possible improvement would be to also replace Multi-word Expressions (MWE) as they represent word sets that should be treated as single lexical units.
Since MWEs are most of the time inherently complex, replacing them with a single word synonym, if possible, could facilitate the TS task and mark an improvement in our overall system as shown in both @https://doi.org/10.48550/arxiv.2005.05692 and @gooding-etal-2020-incorporating.
Replacement of Complex Words
As mentioned above, replacing complex words is our model's biggest bottleneck, as it just accounts for the context and not for the word itself. In naturally occurring sentences, it is completely ordinary for antonyms or words meaning a different thing to be surrounded by the same context (e.g. agreement and disagreement in legal texts), implying that our approach is inherently flawed.
However, previous research we encountered ([@Biran-2011; @Qiang-etal-2020]) uses either a rule-based approach, or a similar masking approach, implying that this is a research field left to further explore. After some consideration, we thought that in a future work setting, it might be worth it to explore modelling this problem as an expression paraphrasing problem. Of course, this leads us to the problem of having no suitable datasets.
Nonetheless, it might be possible to adapt current paraphrase datasets by aligning each sentence pair at a token level, hopefully constructing a dataset of expressions that have a similar meaning, while preserving their context which may still be useful.
Modelling inter-sentence dependencies
Finally, the vast majority of TS related research models the problem in a sentence to sentence scenario. However, in real word applications it is very useful to provide a document level simplification rather than on a sentence level.
Consider, for example, legal text that should still be understood by non-legal experts but is usually too complex and long: if we were to simplify it at the sentence level, we would never achieve optimal simplifications that stem from cutting largely irrelevant portions of text. Thus, confirming that it might be useful to pursue the TS problem from a different prism.
To this effect, @https://doi.org/10.48550/arXiv.2110.05071 introduced a new dataset D-Wikipedia that provides aligned articles in detriment of aligned sentences, while also introducing a corresponding D-SARI metric, formalizing this TS paradigm. Furthermore, by introducing several baseline models, the authors proved there is some merit to this approach.
Concluding, we should also relate this version of the problem with the abstractive summarization one, which is so very similar. The only additional constraint is that in document-level simplification, we require the resulting sentences to be simple, possibly resulting in larger outputs in order to express ideas which could be condensed otherwise. Nevertheless, emulating successful summarization papers could serve some purpose to this new paradigm.
Acknowledgements {#acknowledgements .unnumbered}
We would like to thank professor Miriam Anschütz for her guidance and feedback during the semester, and also for the opportunity to work on the Text Simplification problem.
Explaining Kept Complex Words
As a sidenote that didn't get included in our final system: for words that were to be replaced and had no simpler synonym, it could be useful to include their definition in the produced simplification. To this effect, we used PyDictionary, a Python library that relies on WordNet to get word definitions. Including the meaning of the word is done in a post-processing step and has no other effect on the system.
As expected, adding more text (even potentially complex text) produced a poor SARI result. In table 3{reference-type="ref" reference="complex_word_def"}, we present an example of this definition incorporation procedure.
::: {#complex_word_def} +:---------------------------------------------------------------------+ | The Great Dark Spot is thought to represent a hole in the | | [methane]{.underline} cloud deck of Neptune. | +----------------------------------------------------------------------+ | ------------------------------------------------------------------ | | | | The Great Dark Spot is thought to represent a hole in the | | [methane]{.underline} [(a colorless odorless gas used as a | | fuel)]{.underline} cloud deck of Neptune. | +----------------------------------------------------------------------+ | ------------------------------------------------------------------ | +----------------------------------------------------------------------+
: Complex word identification and definition example. The word 'methane' was identified as being complex. Since it does not have a simpler synonym in WordNet, its definition was added to the source sentence. :::
Demonstration
Concluding, to go along with our presentation, we built an app using Python and Streamlit in order to illustrate our system's capabilities. This app is currently publicly available on a HuggingFace Spaces page^5 for easy user interaction.
Moreover, both the simplification model and the CWI regressor were made
public to the HuggingFace community, available with the names
twigs/bart-text2text-simplifier
and twigs/cwi-regressor
respectively.