Recent advancements in natural language processing have seen the development of increasingly complex and powerful models, capable of generating text that is both fluent and coherent. However, systems employing large language models are prone to factual inconsistencies when generating text. This research enhances AlignScore, a leading model for evaluating textual consistency. First, we refine AlignScore’s methodology by introducing new decomposition techniques for passing in summaries and context into the AlignScore model. We show that breaking down the document context into overlapping intervals for input into AlignScore leads to slight performance gains. Finally, we assess the capabilities of Large Language Models (LLMs) like GPT to measure factual consistency. Our LLM prompts significantly outperform other language model approaches in evaluating factual consistency on the AggreFact SOTA Test benchmark and the HaluEval Summarization dataset. 

Improving Factual Inconsistency Detection in LLMs and AlignScore.pdf