<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>RAG - Inero Software - Software Consulting</title>
	<atom:link href="https://inero-software.com/tag/rag/feed/" rel="self" type="application/rss+xml" />
	<link>https://inero-software.com/tag/rag/</link>
	<description>We unleash innovations using cutting-edge technologies, modern design and AI</description>
	<lastBuildDate>Fri, 14 Feb 2025 14:33:46 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.1</generator>

<image>
	<url>https://inero-software.com/wp-content/uploads/2018/11/inero-logo-favicon.png</url>
	<title>RAG - Inero Software - Software Consulting</title>
	<link>https://inero-software.com/tag/rag/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">153509928</site>	<item>
		<title>Assessing Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) with DeepEval for Complex Tabular Data</title>
		<link>https://inero-software.com/assessing-retrieval-augmented-generation-rag-large-language-models-llms-with-deepeval-for-complex-tabular-data/</link>
		
		<dc:creator><![CDATA[Martyna Mul]]></dc:creator>
		<pubDate>Tue, 04 Feb 2025 10:33:15 +0000</pubDate>
				<category><![CDATA[Company]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI development]]></category>
		<category><![CDATA[AI innovations]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[DeepEval]]></category>
		<category><![CDATA[Large Language Model]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Retrieval-Augmented Generation]]></category>
		<guid isPermaLink="false">https://inero-software.com/?p=6902</guid>

					<description><![CDATA[<p>This post explores how DeepEval helps systematically assess the effectiveness of both retrieval and generation components, ensuring more reliable machine-generated insights. </p>
<p>Artykuł <a href="https://inero-software.com/assessing-retrieval-augmented-generation-rag-large-language-models-llms-with-deepeval-for-complex-tabular-data/">Assessing Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) with DeepEval for Complex Tabular Data</a> pochodzi z serwisu <a href="https://inero-software.com">Inero Software - Software Consulting</a>.</p>
]]></description>
										<content:encoded><![CDATA[		<div data-elementor-type="wp-post" data-elementor-id="6902" class="elementor elementor-6902" data-elementor-post-type="post">
				<div class="elementor-element elementor-element-a77f132 e-flex e-con-boxed e-con e-parent" data-id="a77f132" data-element_type="container">
					<div class="e-con-inner">
		<div class="elementor-element elementor-element-eedef0f e-con-full e-flex e-con e-child" data-id="eedef0f" data-element_type="container">
				</div>
		<div class="elementor-element elementor-element-8bb2c58 e-con-full e-flex e-con e-child" data-id="8bb2c58" data-element_type="container">
		<div class="elementor-element elementor-element-cac0d92 e-con-full e-flex e-con e-child" data-id="cac0d92" data-element_type="container">
				<div class="elementor-element elementor-element-f3a0ecb elementor-widget elementor-widget-html" data-id="f3a0ecb" data-element_type="widget" data-widget_type="html.default">
				<div class="elementor-widget-container">
			 		</div>
				</div>
				<div class="elementor-element elementor-element-33c698c elementor-widget elementor-widget-text-editor" data-id="33c698c" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<h4><span class="TextRun SCXW184211874 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW184211874 BCX0">Retrieval-Augmented Generation (RAG) models are transforming the capabilities of intelligent assistants, enabling more </span><span class="NormalTextRun SCXW184211874 BCX0">accurate</span><span class="NormalTextRun SCXW184211874 BCX0"> and context-aware responses to user queries. Unlike traditional large language models (LLMs), RAG-based systems integrate two essential components: a retrieval mechanism that fetches relevant documents and a generative model that synthesizes responses based on real-time </span><span class="NormalTextRun SCXW184211874 BCX0">data. This post explores how </span><span class="NormalTextRun SCXW184211874 BCX0">DeepEval</span><span class="NormalTextRun SCXW184211874 BCX0"> helps systematically assess the effectiveness of both retrieval and generation components, ensuring more reliable machine-generated insights.</span></span><span class="EOP TrackedChange SCXW184211874 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></h4>						</div>
				</div>
				<div class="elementor-element elementor-element-6e6ea96 elementor-widget elementor-widget-text-editor" data-id="6e6ea96" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">While RAG-enhanced virtual assistants significantly improve answer relevance, evaluating their performance remains a challenge. Since these models rely on both retrieval and text generation, a weak document-fetching step can lead to misleading or incorrect responses, even if the underlying LLM is highly advanced.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">We’ll</span> <span data-contrast="auto">demonstrate</span><span data-contrast="auto"> this process using our custom AI-driven assistant</span><span data-contrast="auto">, designed to answer complex queries about </span><span data-contrast="auto">maritime economy statistics</span><span data-contrast="auto">, </span><span data-contrast="auto">showcasing</span><span data-contrast="auto"> how </span><span data-contrast="auto">LLM-powered knowledge retrieval enhances data-driven decision-making.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-d0eedd1 elementor-widget elementor-widget-heading" data-id="d0eedd1" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h3 class="elementor-heading-title elementor-size-default">SeaStat - Our AI Assistant </h3>		</div>
				</div>
				<div class="elementor-element elementor-element-caacdce elementor-widget elementor-widget-text-editor" data-id="caacdce" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TrackChangeTextInsertion TrackedChange SCXW210561514 BCX0"><span class="TextRun SCXW210561514 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW210561514 BCX0">A great example</span></span></span><span class="TrackChangeTextInsertion TrackedChange SCXW210561514 BCX0"><span class="TextRun SCXW210561514 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW210561514 BCX0"> that we can use to discuss this topic is the </span></span></span><span class="TrackChangeTextInsertion TrackedChange SCXW210561514 BCX0"><span class="TextRun SCXW210561514 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW210561514 BCX0">SeaStat</span></span></span> <span class="TrackChangeTextInsertion TrackedChange SCXW210561514 BCX0"><span class="TextRun SCXW210561514 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW210561514 BCX0">AI Assistant</span></span></span><span class="TrackChangeTextInsertion TrackedChange SCXW210561514 BCX0"><span class="TextRun SCXW210561514 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW210561514 BCX0"> developed by us as part of the Incone60 Green Project (https://www.incone60.eu/). The goal of the project is to improve the competitiveness and sustainable development of small seaports in the South Baltic region.</span></span></span><span class="EOP SCXW210561514 BCX0" data-ccp-props="{&quot;335551550&quot;:6,&quot;335551620&quot;:6}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-42941ab elementor-widget elementor-widget-image" data-id="42941ab" data-element_type="widget" data-widget_type="image.default">
				<div class="elementor-widget-container">
													<img fetchpriority="high" decoding="async" data-attachment-id="6904" data-permalink="https://inero-software.com/assessing-retrieval-augmented-generation-rag-large-language-models-llms-with-deepeval-for-complex-tabular-data/seastat/" data-orig-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png" data-orig-size="517,587" data-comments-opened="0" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="SeaStat" data-image-description="" data-image-caption="" data-medium-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat-264x300.png" data-large-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png" tabindex="0" role="button" width="517" height="587" src="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png" class="attachment-large size-large wp-image-6904" alt="" srcset="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png 517w, https://inero-software.com/wp-content/uploads/2025/02/SeaStat-264x300.png 264w" sizes="(max-width: 517px) 100vw, 517px" data-attachment-id="6904" data-permalink="https://inero-software.com/assessing-retrieval-augmented-generation-rag-large-language-models-llms-with-deepeval-for-complex-tabular-data/seastat/" data-orig-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png" data-orig-size="517,587" data-comments-opened="0" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="SeaStat" data-image-description="" data-image-caption="" data-medium-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat-264x300.png" data-large-file="https://inero-software.com/wp-content/uploads/2025/02/SeaStat.png" role="button" />													</div>
				</div>
				<div class="elementor-element elementor-element-bf0da8d elementor-widget elementor-widget-text-editor" data-id="bf0da8d" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW10433028 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun CommentStart CommentHighlightPipeRestRefresh CommentHighlightRest SCXW10433028 BCX0">Duri</span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">ng Incone60 Gren Project w</span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">e have developed an AI assistant that answers questions about maritime economy data, providing instant access to structured maritime economic insights. This assistant </span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">leverages</span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0"> a </span></span><span class="TextRun SCXW10433028 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">Retrieval-Augmented Generation (RAG)</span></span><span class="TextRun SCXW10433028 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0"> approach, ensuring that responses are grounded in a structured database covering key aspects such as </span></span><span class="TextRun SCXW10433028 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">seaports, maritime transport</span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">,</span><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0"> shipbuilding, passenger traffic, trade, and the fishing industry</span></span><span class="TextRun SCXW10433028 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun CommentHighlightRest SCXW10433028 BCX0">.</span></span><span class="EOP CommentHighlightPipeRestRefresh SCXW10433028 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-85e0e3e elementor-widget elementor-widget-text-editor" data-id="85e0e3e" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">Our AI assistant operates within a RAG pipeline that integrates:</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><ul><li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">A structured maritime economy database</span></b><span data-contrast="auto">, which includes global and Polish maritime statistics from 2017 to 2020. The data is sourced from publications by Gdynia Maritime University, which aggregate statistics from various government institutes, universities, and port enterprises. The database consists of 50 tables, covering key aspects of maritime transport and is planned to be further extended with additional years. </span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Dynamic SQL generation</span></b><span data-contrast="auto"> to extract relevant information from the database.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">A generative LLM</span></b><span data-contrast="auto"> that formulates answers based on the retrieved data.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><p><span data-contrast="auto">Building such an assistant requires several key decisions and parameter optimizations, including:</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><ul><li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><span data-contrast="auto">Selecting the most suitable LLM model and tuning parameters (e.g., temperature).</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><span data-contrast="auto">Designing an effective prompt structure.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><span data-contrast="auto">Ensuring the assistant consistently selects the most relevant tables from the dataset.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><p><span data-contrast="auto">This is where </span><b><span data-contrast="auto">automatic testing</span></b><span data-contrast="auto"> becomes crucial. It helps assess system performance, identify weaknesses, and ensure continuous improvement.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-648b32b elementor-widget elementor-widget-heading" data-id="648b32b" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h3 class="elementor-heading-title elementor-size-default">LLM-as-a-Judge: Automating RAG Model Evaluation  </h3>		</div>
				</div>
				<div class="elementor-element elementor-element-09073bd elementor-widget elementor-widget-text-editor" data-id="09073bd" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">Evaluating systems that generate non-deterministic, open-ended text outputs can be challenging because there is often no single &#8220;correct&#8221; answer. While human evaluation is accurate, it can be costly and time-consuming.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><b><span data-contrast="auto">LLM-as-a-Judge</span></b><span data-contrast="auto"> is a method that approximates human evaluation by rating the system&#8217;s output based on custom criteria tailored to your specific application. One such testing framework is </span><b><span data-contrast="auto">DeepEval</span></b><span data-contrast="auto">, which provides a set of metrics designed for both retrieval and generation tasks and allows you to create your own rating criteria. </span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-b4e1ef9 elementor-widget elementor-widget-text-editor" data-id="b4e1ef9" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">Key evaluation metrics are:</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><ul><li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">G-Eval</span></b><span data-contrast="auto">: A versatile metric that evaluates LLM output based on custom-defined criteria.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Answer Relevancy</span></b><span data-contrast="auto">: Measures how well the model’s response addresses the user query.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">Faithfulness</span></b><span data-contrast="auto">: Assesses how accurately the response aligns with the provided context, helping to limit hallucination in RAG systems.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"><b><span data-contrast="auto">ContextualRecallMetric, ContextualPrecisionMetric, ContextualRelevancyMetric</span></b><span data-contrast="auto">: These metrics are particularly useful for RAG systems, evaluating whether retrieval components return all relevant context while avoiding irrelevant information.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul>						</div>
				</div>
				<div class="elementor-element elementor-element-1c250db elementor-widget elementor-widget-heading" data-id="1c250db" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h3 class="elementor-heading-title elementor-size-default">Step-by-Step RAG Model Testing with DeepEval  </h3>		</div>
				</div>
				<div class="elementor-element elementor-element-e3db2c9 elementor-widget elementor-widget-text-editor" data-id="e3db2c9" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TrackedChange SCXW136457389 BCX0"><span class="TextRun SCXW136457389 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW136457389 BCX0">To ensure the reliability and accuracy of our Retrieval-Augmented Generation (RAG) model, we follow a structured evaluation approach. </span></span></span><span class="TrackedChange SCXW136457389 BCX0"><span class="TextRun SCXW136457389 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW136457389 BCX0">This process involves dataset creation, response generation, and model evaluation using </span></span></span><span class="TrackedChange SCXW136457389 BCX0"><span class="TextRun SCXW136457389 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW136457389 BCX0">DeepEval</span></span></span><span class="TextRun SCXW136457389 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW136457389 BCX0">, allowing us to systematically assess the effectiveness of both retrieval and generation components.</span></span> <span class="TextRun SCXW136457389 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW136457389 BCX0">Let’s</span><span class="NormalTextRun SCXW136457389 BCX0"> break down each step in detail.</span></span><span class="EOP SCXW136457389 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240,&quot;335559740&quot;:279}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-019ed5f elementor-widget elementor-widget-heading" data-id="019ed5f" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h4 class="elementor-heading-title elementor-size-default">1. Dataset Creation </h4>		</div>
				</div>
				<div class="elementor-element elementor-element-741ed12 elementor-widget elementor-widget-text-editor" data-id="741ed12" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0">To evaluate performance, we create a test set consisting of:</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW40841927 BCX0"><span class="SCXW40841927 BCX0"> </span><br class="SCXW40841927 BCX0" /></span><span class="LineBreakBlob BlobObject DragDrop SCXW40841927 BCX0"><span class="SCXW40841927 BCX0"> </span><br class="SCXW40841927 BCX0" /></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0">&#8211; </span></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0">Realistic questions</span></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0"> that users might ask. These can range from simple fact-based queries to more complex, multi-step inquiries that require detailed answers drawn from multiple tables.</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW40841927 BCX0"><span class="SCXW40841927 BCX0"> </span><br class="SCXW40841927 BCX0" /></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0">&#8211; </span></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0">Expected ground truth responses</span></span><span class="TextRun SCXW40841927 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW40841927 BCX0"> derived directly from the database.</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW40841927 BCX0"><span class="SCXW40841927 BCX0"> </span><br class="SCXW40841927 BCX0" /></span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-f4f0b50 elementor-widget elementor-widget-heading" data-id="f4f0b50" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h4 class="elementor-heading-title elementor-size-default">2. Generating Model Responses </h4>		</div>
				</div>
				<div class="elementor-element elementor-element-587939b elementor-widget elementor-widget-text-editor" data-id="587939b" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW56801091 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW56801091 BCX0">For each test query, the assistant generates an answer based on the relevant data retrieved from the database.</span></span><span class="EOP SCXW56801091 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-0deb934 elementor-widget elementor-widget-heading" data-id="0deb934" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h4 class="elementor-heading-title elementor-size-default">3. Evaluation using DeepEval </h4>		</div>
				</div>
				<div class="elementor-element elementor-element-7c0b685 elementor-widget elementor-widget-text-editor" data-id="7c0b685" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">We are particularly focused on </span><b><span data-contrast="auto">factual correctness</span></b><span data-contrast="auto"> for our assistant, so we use the </span><b><span data-contrast="auto">G-Eval metric</span></b><span data-contrast="auto"> to evaluate this aspect.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">We need to define G-Eval by describing testing criteria, e.g.:</span><span data-ccp-props="{}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-3d98b09 elementor-widget elementor-widget-text-editor" data-id="3d98b09" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<pre><span data-contrast="auto">correctness_metric = GEval(    </span> <br /><span data-contrast="auto">    name="Correctness",     </span> <br /><span data-contrast="auto">    evaluation_steps=[  </span> <br /><span data-contrast="auto">        "Assess whether the actual output is accurate in terms of facts compared to the expected output.",      </span> <br /><span data-contrast="auto">        "Penalize missing information."  </span> <br /><span data-contrast="auto">    ],      </span> <br /><span data-contrast="auto">    evaluation_params=[  </span> <br /><span data-contrast="auto">       LLMTestCaseParams.INPUT,   </span> <br /><span data-contrast="auto">       LLMTestCaseParams.ACTUAL_OUTPUT,   </span> <br /><span data-contrast="auto">       LLMTestCaseParams.EXPECTED_OUTPUT  </span> <br /><span data-contrast="auto">    ],    </span> <br /><span data-contrast="auto">)</span> </pre>						</div>
				</div>
				<div class="elementor-element elementor-element-63d0764 elementor-widget elementor-widget-text-editor" data-id="63d0764" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW196212698 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW196212698 BCX0">Additionally, we use several built-in metrics:</span></span><span class="EOP SCXW196212698 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-4344a23 elementor-widget elementor-widget-text-editor" data-id="4344a23" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<pre><span class="TextRun SCXW8241585 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW8241585 BCX0">contextual_precision</span><span class="NormalTextRun SCXW8241585 BCX0"> = </span><span class="NormalTextRun SCXW8241585 BCX0">ContextualPrecisionMetric</span><span class="NormalTextRun SCXW8241585 BCX0">()</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW8241585 BCX0"><span class="SCXW8241585 BCX0"> </span><br class="SCXW8241585 BCX0" /></span><span class="TextRun SCXW8241585 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW8241585 BCX0">contextual_recall = </span><span class="NormalTextRun SCXW8241585 BCX0">ContextualRecallMetric</span><span class="NormalTextRun SCXW8241585 BCX0">()</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW8241585 BCX0"><span class="SCXW8241585 BCX0"> </span><br class="SCXW8241585 BCX0" /></span><span class="TextRun SCXW8241585 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW8241585 BCX0">contextual_relevancy = </span><span class="NormalTextRun SCXW8241585 BCX0">ContextualRelevancyMetric</span><span class="NormalTextRun SCXW8241585 BCX0">()</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW8241585 BCX0"><span class="SCXW8241585 BCX0"> </span><br class="SCXW8241585 BCX0" /></span><span class="TextRun SCXW8241585 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW8241585 BCX0">answer_relevancy = </span><span class="NormalTextRun SCXW8241585 BCX0">AnswerRelevancyMetric</span><span class="NormalTextRun SCXW8241585 BCX0">()</span></span><span class="LineBreakBlob BlobObject DragDrop SCXW8241585 BCX0"><span class="SCXW8241585 BCX0"> </span><br class="SCXW8241585 BCX0" /></span><span class="TextRun SCXW8241585 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW8241585 BCX0">faithfulness = </span><span class="NormalTextRun SCXW8241585 BCX0">FaithfulnessMetric</span><span class="NormalTextRun SCXW8241585 BCX0">()</span></span><span class="EOP SCXW8241585 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></pre>						</div>
				</div>
				<div class="elementor-element elementor-element-5d2ddcf elementor-widget elementor-widget-text-editor" data-id="5d2ddcf" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW77075170 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW77075170 BCX0">We then define test cases:</span></span><span class="EOP SCXW77075170 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-f16e61d elementor-widget elementor-widget-text-editor" data-id="f16e61d" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<pre><span data-contrast="auto">test_case = LLMTestCase(  </span> <br /><span data-contrast="auto">    input=#user prompt,  </span> <br /><span data-contrast="auto">    actual_output=#model output here,  </span> <br /><span data-contrast="auto">    expected_output=#the ground truth response </span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span><br /><br /><span data-contrast="auto">    retrieval_context=#data extracted by retriever, in our case it is data extracted from the database</span> <br /><span data-contrast="auto">)</span> <br /><span data-ccp-props="{}"> </span></pre>						</div>
				</div>
				<div class="elementor-element elementor-element-466c61d elementor-widget elementor-widget-text-editor" data-id="466c61d" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW131448305 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW131448305 BCX0">Here is </span><span class="NormalTextRun SCXW131448305 BCX0">one of</span><span class="NormalTextRun SCXW131448305 BCX0"> test case</span><span class="NormalTextRun SCXW131448305 BCX0">s</span><span class="NormalTextRun SCXW131448305 BCX0"> we used to </span><span class="NormalTextRun SCXW131448305 BCX0">evaluate our </span><span class="NormalTextRun SCXW131448305 BCX0">SeaStat</span> <span class="NormalTextRun SCXW131448305 BCX0">Assitant</span><span class="NormalTextRun SCXW131448305 BCX0">:</span></span><span class="EOP SCXW131448305 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-53c56bc elementor-widget elementor-widget-text-editor" data-id="53c56bc" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<pre><span data-contrast="auto">test_case = LLMTestCase(  </span> <br /><span data-contrast="auto">    input='Compare cargo traffic in Suez Canal and Panama Canal in 2019',  </span> <br /><span data-contrast="auto">    actual_output= 'In 2019, the cargo traffic data for the Suez Canal and Panama Canal was as follows: Suez Canal - 1031 million tons; Panama Canal - 243059 thousand tons. The Suez Canal had significantly higher cargo traffic compared to the Panama Canal in 2019.' </span> <br /><span data-contrast="auto">    expected_output=' In 2019, the Suez Canal handled 1,031 million tons of cargo, whereas the Panama Canal transported only 243 million tons. This indicates that the Suez Canal carried a substantially higher volume of cargo than the Panama Canal that year.' </span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span><br /><br /><span data-contrast="auto">    retrieval_context=[</span><span data-ccp-props="{}"> </span><br /><br /><span data-contrast="auto">{'table': 'Suez_Canal_Cargo_Traffic', 'year': 2019, 'cargo_volume_million_tons': 1031},</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span><br /><br /><span data-contrast="auto">{'table': 'Panama_Canal_Cargo_Traffic', 'year': 2019, 'direction': 'Atlantic – Pacific', 'cargo_volume_thousand_tons': 156899}, {'table': 'Panama_Canal_Cargo_Traffic', 'year': 2019, 'direction': 'Pacific – Atlantic', 'cargo_volume_thousand_tons': 86160}</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span><br /><br /><span data-contrast="auto">]</span> <br /><span data-contrast="auto">)</span> </pre>						</div>
				</div>
				<div class="elementor-element elementor-element-2c1ba07 elementor-widget elementor-widget-text-editor" data-id="2c1ba07" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span class="TextRun SCXW81219040 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW81219040 BCX0">And run evaluation:</span></span><span class="EOP SCXW81219040 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-fb89c70 elementor-widget elementor-widget-text-editor" data-id="fb89c70" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<pre><span data-contrast="auto">assert_test(test_case, [correctness_metric, answer_relevancy, contextual_precision, contextual_recall, contextual_relevancy, faithfulness])</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></pre>						</div>
				</div>
				<div class="elementor-element elementor-element-9893049 elementor-widget elementor-widget-heading" data-id="9893049" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
			<h4 class="elementor-heading-title elementor-size-default">4. Testing results </h4>		</div>
				</div>
				<div class="elementor-element elementor-element-283d669 elementor-widget elementor-widget-text-editor" data-id="283d669" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">DeepEval assigns each metric a score between 0 and 1, accompanied by a descriptive explanation of the rating. Below are the results from a test case evaluating SeaStat&#8217;s response to the prompt:</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><b><span data-contrast="auto">&#8220;Compare cargo traffic in the Suez Canal and Panama Canal in 2019.&#8221;</span></b><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">Metric interpretations:</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><ul><li data-leveltext="" data-font="Symbol" data-listid="8" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Contextual Recall</span></b> <b><span data-contrast="auto">(1.0)</span></b><span data-contrast="auto"> &#8211; The retriever effectively retrieved the necessary information, meaning that almost all essential details from the expected output were present in the retrieval context.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="8" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Contextual Relevancy (0.95)</span></b><span data-contrast="auto"> and </span><b><span data-contrast="auto">Contextual Precision (1.0)</span></b><span data-contrast="auto"> &#8211; The retrieved context was highly relevant to the query, showing that the retriever pulled information accurately related to the input.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="9" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Faithfulness</span></b> <b><span data-contrast="auto">(1.0)</span></b><span data-contrast="auto"> &#8211; The model’s response remained perfectly factual, strictly adhering to the retrieved information without introducing any hallucinations.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="9" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Answer Relevancy</span></b> <b><span data-contrast="auto">(1.0)</span></b><span data-contrast="auto"> – The model&#8217;s response fully addressed the user query, ensuring that the answer was on point.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><ul><li data-leveltext="" data-font="Symbol" data-listid="9" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559683&quot;:0,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">Correctness</span></b><span data-contrast="auto">, </span><b><span data-contrast="auto">(0.78)</span></b><span data-contrast="auto"> – the correctness score was slightly lower due to numerical discrepancies caused by rounding.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></li></ul><p><span data-contrast="auto">By systematically analyzing test cases with DeepEval, we gain valuable insights into where our RAG model excels and where improvements are needed. Future optimizations could include refining retrieval strategies, adjusting prompt engineering, or fine-tuning LLM parameters for better factual accuracy.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-0445df6 elementor-widget elementor-widget-text-editor" data-id="0445df6" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<table style="font-weight: 400;" data-tablestyle="MsoTableGrid" data-tablelook="1696" aria-rowcount="7"><tbody><tr aria-rowindex="1"><td data-celllook="0"><p><b><span data-contrast="auto">Test case</span></b><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><b><span data-contrast="auto">Metric</span></b><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><b><span data-contrast="auto">Score</span></b><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><b><span data-contrast="auto">Status</span></b><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><b><span data-contrast="auto">Overall Success Rate</span></b><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="2"><td colspan="1" rowspan="6" data-celllook="0"><p><span data-contrast="auto">test_case_0</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">Correctness (GEval)</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">0.78 (threshold=0.5, evaluation model=gpt-4o, reason=The actual output closely matches the expected output in terms of cargo volumes and comparative conclusion, but the numbers are expressed in different units (thousand tons vs million tons) and slightly differ, which may indicate rounding or conversion discrepancies., error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td><td colspan="1" rowspan="6" data-celllook="0"><p><span data-contrast="auto">100%</span><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="3"><td data-celllook="0"><p><span data-contrast="auto">Answer Relevancy</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because the response thoroughly addressed the comparison of cargo traffic in the Suez Canal and the Panama Canal in 2019 with no irrelevant details included. It&#8217;s precise and to the point, showcasing a deep understanding of the topic., error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="4"><td data-celllook="0"><p><span data-contrast="auto">Contextual Precision</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because the relevant nodes, offering essential data for comparing cargo traffic in the Suez and Panama Canals in 2019, are perfectly ranked at the top. These nodes effectively deliver a comprehensive breakdown of cargo volumes through both canals during that year, ensuring accurate comparisons can be made efficiently., error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="5"><td data-celllook="0"><p><span data-contrast="auto">Contextual Recall</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because every sentence in the expected output aligns perfectly with the data from the nodes in the retrieval context, effectively illustrating the significant difference in cargo volumes handled by both canals. Well done on maintaining precise and accurate attention to detail!, error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="6"><td data-celllook="0"><p><span data-contrast="auto">Contextual Relevancy</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">0.95 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 0.95 because although the context is rich with detailed data on Suez Canal cargo traffic, it lacks specific information on the Panama Canal&#8217;s cargo traffic, necessitating additional data for a complete comparison., error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td></tr><tr aria-rowindex="7"><td data-celllook="0"><p><span data-contrast="auto">Faithfulness</span><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">1.0 (threshold=0.5, evaluation model=gpt-4o, reason=Awesome job! The score is 1.00 because there are no contradictions present, showcasing perfect alignment and faithfulness of the actual output to the retrieval context. Keep up the excellent work!, error=None)</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-ccp-props="{}"> </span></p></td><td data-celllook="0"><p><span data-contrast="auto">PASSED</span><span data-ccp-props="{}"> </span></p></td></tr></tbody></table>						</div>
				</div>
				<div class="elementor-element elementor-element-abdf550 elementor-widget elementor-widget-text-editor" data-id="abdf550" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
							<p><span data-contrast="auto">Evaluating Retrieval-Augmented Generation (RAG) models requires a structured approach to ensure both retrieval accuracy and response reliability. </span><span data-contrast="auto">LLM-as-a-Judge</span> <span data-contrast="auto">provides</span><span data-contrast="auto"> an efficient alternative to human evaluation by systematically assessing outputs based on predefined criteria, enabling scalable and cost-effective validation.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">Using </span><span data-contrast="auto">DeepEval</span><span data-contrast="auto">, we tested our AI-driven </span><span data-contrast="auto">SeaStat</span><span data-contrast="auto"> Assistant</span><span data-contrast="auto"> against key evaluation metrics, including </span><span data-contrast="auto">Correctness (G-Eval), Answer Relevancy, Contextual Precision, Contextual Recall, Contextual Relevancy, and Faithfulness</span><span data-contrast="auto">. The results highlighted </span><span data-contrast="auto">minor discrepancies in numerical representation, missing contextual details, and retrieval precision—insights crucial f</span><span data-contrast="auto">o</span><span data-contrast="auto">r refining model performance.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">These findings emphasize that </span><span data-contrast="auto">even high-performing RAG models require rigorous evaluation to ensure factual accuracy and prevent misleading outputs</span><span data-contrast="auto">. By automating this process, we enable continuous model improvement, ensuring </span><span data-contrast="auto">AI-driven assistants deliver reliable, context-aware insights at scale</span><span data-contrast="auto">.</span> <span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p><p><span data-contrast="auto">AI-powered assistants are undoubtedly a technology that will become an indispensable tool for employees at all levels—from executives and directors to specialists. Their dynamic development allows them to instantly adapt to business needs and evolving expectations.</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559738&quot;:240,&quot;335559739&quot;:240}"> </span></p>						</div>
				</div>
				<div class="elementor-element elementor-element-308ac2d elementor-cta--skin-cover elementor-animated-content elementor-bg-transform elementor-bg-transform-zoom-in elementor-widget elementor-widget-call-to-action" data-id="308ac2d" data-element_type="widget" data-widget_type="call-to-action.default">
				<div class="elementor-widget-container">
					<div class="elementor-cta">
					<div class="elementor-cta__bg-wrapper">
				<div class="elementor-cta__bg elementor-bg" style="background-image: url(https://inero-software.com/wp-content/uploads/2024/12/3-1030x1030.png);" role="img" aria-label="3"></div>
				<div class="elementor-cta__bg-overlay"></div>
			</div>
							<div class="elementor-cta__content">
				
									<h2 class="elementor-cta__title elementor-cta__content-item elementor-content-item elementor-animated-item--grow">
						We create reliable AI assistants					</h2>
				
									<div class="elementor-cta__description elementor-cta__content-item elementor-content-item elementor-animated-item--grow">
						If you're looking for a company to help you implement an AI-based solution, reach out to us. We’d be happy to discuss your idea.					</div>
				
									<div class="elementor-cta__button-wrapper elementor-cta__content-item elementor-content-item elementor-animated-item--grow">
					<a class="elementor-cta__button elementor-button elementor-size-" href="https://inero-software.com/contact-us/">
						Contact Us					</a>
					</div>
							</div>
						</div>
				</div>
				</div>
				</div>
				</div>
		<div class="elementor-element elementor-element-961021e e-con-full e-flex e-con e-child" data-id="961021e" data-element_type="container">
				</div>
					</div>
				</div>
				</div>
		<p>Artykuł <a href="https://inero-software.com/assessing-retrieval-augmented-generation-rag-large-language-models-llms-with-deepeval-for-complex-tabular-data/">Assessing Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) with DeepEval for Complex Tabular Data</a> pochodzi z serwisu <a href="https://inero-software.com">Inero Software - Software Consulting</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6902</post-id>	</item>
	</channel>
</rss>
