⛓Chain of Verification (CoVe) — Understanding & Implementation💡

Preface

sourajit roy chowdhury
9 min readOct 9, 2023

For those eager to dive straight into the code, it’s available on my GitHub repository. However, I’d advise taking a moment to read through this article for a comprehensive understanding before diving in. The README file provides a comprehensive guide on how to use the codes without any hassle.

Introduction

When dealing with Large Language Models (LLMs), a significant challenge, particularly in factual question and answering, is the issue of hallucinations. Hallucinations occur when the answer appears plausible but is factually incorrect. Detecting these hallucinations can be challenging at a high-level inspection and often requires a more detailed examination.

To address this challenge, the Meta AI team has introduced a method called Chain of Verification (CoVe), which consists of the following four sequential steps:

  1. Initial Baseline Response Creation: In this step, an initial response to the original question is generated as a starting point.
  2. Verification Question Generation: Verification questions are created to fact-check the baseline response. These questions are designed to scrutinize the accuracy of the initial response.
  3. Execute Verification: The verification questions are independently answered to minimize any potential bias. This step ensures that the verification process is objective and thorough.
  4. Final Refined Answer Generation: Based on the results of the verification process, a final refined answer is generated. This answer is expected to be more accurate and reliable, reducing the likelihood of hallucinations in the response.

The Chain of Verification (CoVe) method is designed to enhance the reliability of answers provided by Large Language Models, particularly in factual question and answering scenarios, by systematically verifying and refining responses to minimize inaccuracies.

In this article I will try to provide an easy understanding of the CoVe process along with a starting level implementation. You can read the paper here.

🖇Chain of Verification 🖇

The concept behind Chain of Verification (CoVe) is grounded in the notion that a response generated by a Large Language Model (LLM) can be used to validate itself. This self-verification process is employed to assess the accuracy of the initial response and refine it for greater precision. Achieving this relies on skillfully crafting and sequencing LLM prompts.

In accordance with the research paper, we will delve into each of the steps involved in creating a coherent chain that enables the LLM to self-verify its responses.

Generate Baseline Response: When presented an initial query, it is directly fed into the LLM without any other special prompting to get an initial response. This initial step not only serves as the starting point in the CoVe pipeline but also acts as the baseline that aims to enhance through this CoVe pipeline. Since baseline responses like these are often susceptible to hallucinations, the CoVe approach aims to detect and rectify these inaccuracies in the subsequent stages.

Plan Verification: Given the original query and the baseline response as conditions, the model is instructed to produce a set of verification questions designed to assess the accuracy of the factual assertions made in the initial baseline response. It’s important to emphasize that these verification questions are not pre-defined templates; instead, the language model has the flexibility to phrase them in any manner it deems appropriate. Although, these verification questions should be constructed in such a way that their answers would be helpful to refine the baseline response.

Execute Verification: Once the verification questions have been planned, the subsequent step involves answering these questions systematically to determine the presence of any hallucinations. This verification process can include engineered techniques/external tools like verification via web search. Also, you can rely on the LLM itself at all stages of the CoVe process, which will validate its own responses. The authors, explored several different approaches to verification execution, including joint, 2-Step, factored, and factor+revise variants.
1. Joint: In this method both the planning and verification steps are done jointly using a single prompt request made to the LLM. Although, this method is not recommended as the verification result can be hallucinated and affected by the bias.
2. 2-Step: This step is simply opposite of the “Joint” step. In the first step the verification questions are generated and in the second step the verification questions are answered.
3. Factored: Instead of using one big response, it is better answering each question separately. This way, the answers won’t be just copies of the baseline response. This approach also helps avoid confusion between different questions and it might be able to handle more verification questions, even though it could be computationally expensive.
4. Factored + Revise: After we get the answers of the verification questions, the CoVe pipeline needs to check if the answers match with the baseline response. This is done by comparing the answers to the baseline response. This as a separate step by using an additional prompt for the LLM. This extra step helps the system think more carefully about this comparison.

Implementation (Python 🐍 + Langchain 🔗 + OpenAI 🦾 + Search Tool 🔍)

CoVe Pipeline

The verification process introduced by the author is benchmarked using a series of questions. These questions are categorized into three main groups (although the author initially breaks them into four categories):

1. Wiki Data & Wiki Category List: This category involves questions that expect answers in the form of a list of entities. For instance, questions like “Who are some politicians born in Boston?” or “Name some endemic orchids of Vietnam?” should result in answers that present a list of specific entities.

2. Multi-Span QA: Questions in this category seek multiple independent answers, each sourced from different non-adjacent sections of a text. An example would be: “Who invented the first mechanized printing press and in what year?” The answer is “Johannes Gutenberg, 1450”.

3. Long-form Generation: This category predominantly consists of biographical questions, as highlighted by the authors’ benchmark. However, it isn’t limited to biographies. Any question that requires a detailed or lengthy response falls under this group.

I have implemented the CoVe pipeline in line with the four stages outlined in the original paper. Based on the types of questions mentioned earlier, I’ve established three distinct CoVe chains. Additionally, I’ve incorporated a routing mechanism that directs the original query to the appropriate chain.

Please visit my GitHub repository to use the code and more details on getting started.

Router Mechanism: Upon a user entering their query or question, this mechanism springs into action. It categorizes the user’s question into one of the three previously mentioned categories: Wiki List Question, Multi Span Question, or Long Form Question. Depending on this categorization, the router then directs the question to the appropriate chain, each specifically designed to handle one of the three question types. This classification is achieved using a simple few-shot prompt design. You can get more idea on the prompt here.

Baseline Response: This stage is straightforward and doesn’t require any prompt crafting. At this point, the user’s query is processed by the LLM, resulting in what we refer to as the “baseline response”. This initial response will subsequently be assessed and refined to produce the final answer. You can get more idea on the the prompts here for all of the types of questions.

Verification Questions Generation: This stage is pivotal, requiring meticulous crafting and optimization of the prompt to ensure that the verification questions align seamlessly with the original query. If these verification questions stray from the primary intent, the purpose of the entire chain could be compromised. To better grasp this, let’s consider an example.
Original Question: Name of the CEOs of US based organizations who were Indian origin.
Baseline Response: 1. Satya Nadella (CEO of Microsoft), 2. Sundar Pichai (CEO of Google) 3. Mark Zuckerberg (CEO of Meta)
Verification Questions (Set-1): 1. Is Satya Nadella CEO of Microsoft? 2. Is Sundar Pichai CEO of Google? 3. Is Mark Zuckerberg CEO of Meta?
Verification Questions (Set-2): 1. Is Satya Nadella CEO of Microsoft was an Indian origin? 2. Is Sundar Pichai CEO of Google was an Indian origin? 3. Is Mark Zuckerberg CEO of Meta was an Indian origin?
Upon closely examining the two sets of verification questions, we can observe the following:
In Set-1, all three questions will receive a verification answer of “Yes.” The final refined response will include the three names provided in the baseline response. This isn’t the desired outcome since the primary objective of the question is to identify CEOs who were of Indian origin. The questions in Set-1 fail to capture this specific intent.
Conversely, Set-2 is more aligned with our objective. For instance, the third verification question will correctly exclude Mark Zuckerberg because, while he is the CEO of Meta, he is not of Indian origin.
Hence, precise prompt engineering and thorough experimentation are crucial at this stage. For further insight into prompt structuring for various question types, you can refer here.

Execute Verification Questions: This stage is as crucial as the preceding one. Even with highly accurate verification questions that align with the main objective, the quality of the final refined answer greatly hinges on this phase. While the authors relied solely on the LLM to address the generated verification questions, one has the flexibility to utilize various concepts or external tools for this purpose. In my approach, I employed a free search tool, “duckduckgo-search”, to source the answers. These search results then serve as the reference context for the LLM to address each verification question. Alternatives include more sophisticated search tools, RAG-based systems, databases, or other retrieval tools and mechanisms to answer the verification questions crafted earlier. For further insight into prompt structuring you can refer here.

Final Refined Answer: This step is relatively straightforward. It involves utilizing all the previous data (original query, baseline response, verification questions, and their respective answers) to formulate a prompt that delivers the final refined answer. See the example prompt for reference.

How to improve the overall CoVe pipeline

1️. Prompt Engineering: One of the major ways to improve performances of any LLM powered applications is through prompt engineering and prompt optimizations. You can check all the prompts used in the implementation of my GitHub. Try your own prompt engineering and experiment in your use case.
2️. External Tools: As the final output highly depends on the answers of the verification questions, based on different use cases you can try out different tools. For factual questions & answering you can use advanced search tools like google search or serp API etc. For custom use cases you can always use RAG methods or other retrieval techniques for answering the verification questions.
3️. More Chains: I have implemented three chains according to the three question types (Wiki Data, Mutli-Span QA & Long-Form QA) the authors have used for their research. Depending on your use case you can create other chains which can handle other types of QA methods to increase the variability.
4️. Human In Loop (HIL): HIL is one of the important steps in many LLM powered applications. In your specific applications, the whole pipeline can be designed to incorporate HIL either for generating proper verification questions or answering verification questions to further improve the overall CoVe pipeline.

Limitations

Key limitations of the Chain-of-Verification (CoVe) method:

1. Incomplete Removal of Hallucinations: CoVe does not completely eliminate hallucinations in generated content, which means it can still produce incorrect or misleading information.

2. Limited Scope of Hallucination Mitigation: CoVe primarily addresses hallucinations in the form of directly stated factual inaccuracies but may not effectively handle other forms of hallucinations, such as errors in reasoning or opinions.

3. Increased Computational Expense: Generating and executing verification alongside responses in CoVe adds to the computational cost, similar to other reasoning methods like Chain-of-Thought.

4. Upper Bound on Improvement: The effectiveness of CoVe is limited by the overall capabilities of the underlying language model, particularly in its ability to identify and rectify its own mistakes.

Conclusion

The paper presented the Chain-of-Verification (CoVe) method, a strategy designed to make large language models, think more critically about their answers and correct themselves if needed. Is has been discovered that these models are better at answering detailed verification questions than just the initial question. This is because this approach breaks the verification down into simpler, more manageable questions. Also, it has been found that preventing the model from revisiting its previous answers helps avoid repeating any mistakes or “hallucinations”. In simple terms, the technique greatly improves the model’s responses just by making it double-check its answers. One potential improvement could be to give CoVe extra tools, like allowing it to pull information from external sources, which could further enhance its performance.

I hope this article offered valuable insights into understanding of the Chain-of-Verification (CoVe) method and its implementation. If you found the content informative and think it could be beneficial to others, I’d be grateful if you could like 👍, follow 👉, and share✔️ this piece. Additionally, please consider giving a star to my GitHub repository. Your support and appreciation make a difference.

--

--