AI development - Inero Software - Software Consulting

LLM Implementation and Maintenance Costs for Businesses: A Detailed Breakdown

Martyna Mul — Wed, 14 May 2025 06:44:35 +0000

When considering the introduction of artificial intelligence into your company, it’s important to understand the costs involved in implementing and maintaining your own LLM. Expenses go beyond just paying for model usage (e.g., token-based API fees) and include a range of factors — from infrastructure to security. Below, we discuss the types of costs associated with using dedicated LLMs and present example calculations for popular models (such as GPT-4, Claude, Mistral, LLaMA, etc.), including business use case scenarios.

More and more companies are considering the use of large language models (LLMs) in their own products and processes. These “dedicated” models can act as intelligent assistants—answering customer questions, analyzing documents, generating reports, and much more. You can read more about it here.

Types of Costs When Using LLMs

Before starting the implementation, it’s important to understand all the components that contribute to the total cost of using a dedicated model.

Infrastructure:

If you’re using models via a cloud API (OpenAI, Anthropic, Google), you only pay for the tokens used. The infrastructure cost is “hidden” on the provider’s side.

If you choose to self-host a model such as Mistral or LLaMA, you’ll need to maintain a GPU server—either locally or in the cloud. For example, renting an instance with an A100 GPU typically costs $1–2 per hour, which amounts to $750–1,500 per month if the server runs continuously. While such an investment can handle a high volume of queries, it may be underutilized at a smaller scale.

Licensing and Model Fees

Commercial models come with licensing or subscription fees. For example, when using the GPT-4 API from OpenAI or Claude from Anthropic, you pay per token used according to the provider’s pricing (we outline token costs in detail later on). On the other hand, open-source models like LLaMA or Mistral are available for free—there are no licensing or token fees. Meta, for instance, released LLaMA 2 under a license that allows businesses to use it freely. However, “free” doesn’t mean zero cost—you’ll still pay for the infrastructure and electricity needed to run the model (as mentioned earlier). It’s also important to check license restrictions: some open models may have specific usage conditions (e.g., restrictions on certain industries).

Model Adaptation and Customization

For an LLM to perform well in a specific company setting, it often requires customization—such as additional training (fine-tuning) on company-specific data or at least the preparation of tailored prompts (known as prompt engineering). This adaptation process can generate significant costs:

- Model Fine-Tuning: Training a model on your own dataset requires computing power (typically GPUs running for many hours) and expert knowledge. For larger models, this can cost anywhere from several thousand to tens of thousands of dollars—factoring in both infrastructure expenses and specialist time. Even fine-tuning a smaller model (e.g., GPT-3.5) via OpenAI’s API can incur significant costs, as it involves processing hundreds of thousands or even millions of tokens during training—billed according to the provider’s token pricing.

- Prompt Engineering: As an alternative or complement to training, you can craft tailored prompts and instructions for the model. While writing prompts itself doesn’t require paid resources, iteratively testing and refining multiple versions consumes tokens (which adds cost when using a cloud-based model) and takes up team time. This can be viewed as either an operational cost or a competence-related expense—specialist time is needed to optimize the model’s behavior for your specific use case.

Operational Costs

After deploying the model, ongoing operational costs come into play. These include monitoring the model’s performance, maintaining efficiency, logging results, applying updates, and fixing potential issues. If you’re using an API, the main operational cost will be the monthly bill for consumed tokens, along with any premium subscription fees (some providers offer subscription plans with usage limits or preferred pricing). If the model is hosted locally, operational costs typically include:

- Electricity consumption – GPU-based models can consume significant amounts of power, leading to substantial monthly energy costs.
- System administration – Time spent by administrators on server maintenance, backups, and updating software components (e.g., AI libraries).
- Infrastructure scaling – As demand grows, additional machines or cloud instances may be needed, resulting in further expenses.
- High availability – If the LLM assistant needs to operate 24/7 without downtime, you may need to invest in redundant resources (e.g., backup servers) or enter into an SLA agreement with your cloud provider.

Team Expertise

Implementing an LLM requires the right expertise within the IT/Data team. If your company lacks AI experience, it may be necessary to train existing employees or hire new specialists—such as an ML engineer or MLOps expert—which adds recruitment or training costs. Alternatively, some companies choose to work with external consultants or service providers to deploy the model. This also incurs costs, usually one-time project fees, which can be significant. It’s also important to account for the time your team spends integrating the model with existing systems (e.g., connecting it to a database or user-facing application). This is a labor cost that’s often overlooked in smaller projects but can have a major impact in practice.

The categories above show that the total cost of owning a dedicated LLM-based solution goes far beyond just the fee for accessing the model. It’s important to consider all these factors before making a decision. In the next section, we’ll look at specific numbers: how much a single prompt costs for various popular models, and what it would take to maintain a simple LLM assistant in two example business scenarios.

Cost of a Single Prompt in Popular LLM Models

Language models are typically billed based on the number of tokens processed. A token is a small piece of text—it may represent a single word or part of a word (for example, 1,000 tokens roughly equals 750 words of continuous text). API providers list prices per 1,000 or 1 million tokens.

Below is a comparison of the approximate cost to process 1,000 tokens using selected popular LLM models:

LLM Model Comparison

LLM Model	Access / License	Cost per 1000 tokens	Notes
GPT-3.5 Turbo (OpenAI)	Cloud API (chat model available, e.g., in ChatGPT)	$0.0015 (input) $0.0020 (output)	Very low cost – 16k tokens + paid upgrade to 128k Good response quality
GPT-4 (8k)	Cloud API (OpenAI)	$0.08 (input) $0.16 (output)	High quality; high cost
GPT-4 Turbo (128k)	Cloud API (OpenAI)	$0.01 (input) $0.03 (output)	Reliable large context (up to 128k tokens) Cheaper (only slightly more than GPT-3.5)
Claude Instant v1.2	Cloud API (Anthropic)	$0.0008 (input) $0.0024 (output)	Fast, lower-cost Claude model (equivalent to GPT-3.5)
Claude 2 (100k)	Cloud API (Anthropic)	$0.008 (input) $0.024 (output)	High-quality model by Anthropic; context up to 100k tokens
Mistral 7B	Open source (free model)	Token cost: $0	Requires self-hosting Alternative to GPT-3.5 – low hardware requirements (can run with <1M tokens)
LLaMA 2 13B	Open source (free model)	Token cost: $0	Self-hosting required Needs stronger hardware (e.g., 2× 24GB GPU) than 7B, but still accessible for many companies
LLaMA 2 70B	Open source (free model)	Token cost: $0	Requires self-hosting Requires expensive infrastructure (e.g., 8× 80GB GPUs) At this scale, costs may match or even exceed GPT-4

Legend: How Token Costs Are Calculated

- Input tokens – words contained in the user’s prompt.
- Output tokens – words generated by the model in response (completion).

For most commercial providers, the cost is charged separately for input and output tokens. For example:

GPT-4 Turbo:

- 1,000 input tokens: $0.03
- 1,000 output tokens: $0.06

If a dialogue contains a total of 1,000 tokens (e.g., 500 input + 500 output), the cost is approximately $0.045.

For simplicity, you can assume that a full interaction of 1,000 tokens costs about $0.09.

By comparison:

- GPT-3.5 Turbo – a similar 1,000-token dialogue costs only about $0.0035 (i.e., 0.35 cents).
- Open-source models (e.g., Mistral, LLaMA) – token costs are $0, since the models run locally. You only pay for infrastructure-related costs (power consumption, server uptime, etc.).

Open-source models (such as Mistral, LLaMA, etc.) are attractive because they come with no fees for the model itself—you can generate any number of tokens without paying the model provider a cent. However, to run these models, you need to maintain your own infrastructure. At a small scale, the cost of renting a machine for a single query may actually exceed the cost of an individual API call to a model like GPT. On the other hand, at a large scale—with many queries per day—open-source solutions can become significantly more cost-effective. In summary, cost-effectiveness depends on the use case, which we’ll explore in the next section.

Example Costs of Implementing an LLM Assistant (100 Queries per Day)

Let’s now consider a practical scenario: your company wants to implement a simple LLM-based virtual assistant that performs one of the following tasks:

- Document analysis – e.g., the assistant reads offers or contracts and extracts key information such as clauses, deadlines, and amounts.
- Customer inquiry handling – e.g., the assistant replies to customer emails with questions about pricing, product availability, technical support, etc.

Let’s assume that:

- The assistant will handle approximately 100 interactions per day.
- Each interaction consists of a prompt and a response, totaling around 2,000 tokens (e.g., 1,000 tokens in the prompt—roughly 750 words or several paragraphs—and 1,000 tokens in the response, or about 750 generated words). This token size covers fairly complex queries and detailed replies.
- On a monthly basis, the assistant will process around 6 million tokens (3,000 interactions × 2,000 tokens = 6,000,000 tokens).

We want to compare the monthly operating costs of such an assistant depending on the choice of model and deployment approach. We’ll present two variants:

- API Variant (Closed Model): We use a commercial model via an API (e.g., OpenAI GPT or Anthropic Claude). We don’t maintain our own servers—costs are limited to token usage, billed according to the provider’s pricing.
- Self-Hosted Variant (Open-Source Model): We use an open-source model (e.g., Mistral or LLaMA) deployed on our own servers. Costs include infrastructure needed to support approximately 100 queries per day—such as cloud GPU instance rental or hardware amortization, plus electricity.

Below is a table comparing estimated monthly costs for several example models under both deployment variants, assuming 6 million tokens per month:

Monthly LLM Cost Comparison

Model (variant)	Estimated Monthly Cost	Comment
GPT-3.5 Turbo (API)	approx. $18 (USD)	Very low cost for this quality level. Estimate: approx. $0.0027/1k tokens → $12 for generating 4M tokens + $6 for prompts → ~$18/month total.
GPT-4 (8k) (API)	approx. $270	Much higher cost for better quality. Example: 8M tokens → cost: 8M × $0.08/1k (input) + $0.16/1k (output) → $270–$540 monthly.
GPT-4 Turbo (128k) (API)	approx. $18	Slightly more expensive than GPT-3.5 due to cheaper input/output token pricing. May even deliver better quality than GPT-4 (8k).
Claude Instant (API)	approx. $20–25	Comparable to GPT-3.5 in cost. Estimate: approx. $0.0021/1k tokens (input+output) → ~$18–25 for 8M tokens (plus potential flat fees).
Claude 2 (API)	approx. $150–200	Cheaper than GPT-4, but still several times more expensive than GPT-3.5. Estimate: $0.032/1k tokens → ~$192 for 8M tokens.
Mistral 7B (open source, self-hosted, 1x GPU)	approx. $300	Cost mainly for maintaining server/GPU. Assumption: 1x 24GB GPU instance – model generates ~30–60 tokens/sec, power usage 100–150W. Actual cost depends on location and usage (electricity + server = ~$300–400/month).
LLaMA 2 70B (open source, self-hosted, multi-GPU)	approx. $1,000+	High cost due to powerful GPU requirements. Typically requires at least 8×80GB GPUs (~$10k–12k hardware + high power consumption). Costs vary based on setup model (on-prem / cloud / GPU provider).
Local model (e.g., LLaMA 13B, GPTQ, Mistral 7B – CPU)	approx. $300–500	Cost includes operation of local server. May be slower than GPT-3.5, but offers more privacy and control. For CPU instance (e.g., 12 cores, 64 GB RAM), monthly cost is mainly for electricity and maintenance.

From the above comparison, several key takeaways can be drawn:

Small-scale usage (100 queries/day) favors API solutions

With relatively low query volume, using a commercial API (OpenAI, Anthropic) is highly cost-effective—especially with lower-priced models like GPT-3.5 or Claude Instant, where monthly costs can be as low as a few dozen dollars. For higher-end models, monthly costs may rise to several hundred dollars. Still, at this scale, running your own GPU server at $300+ per month would be less economical than relying on cloud-based APIs.

Large-scale usage (thousands of queries) changes the equation

If your assistant becomes successful and the number of queries increases by 10x or even 100x, the monthly API bill could grow to thousands or even tens of thousands of dollars. In such cases, investing in an open-source, self-hosted model starts to make financial sense. With a high enough query volume, the per-request cost of running the model locally becomes lower than the API cost—since the purchased or rented hardware is being used more efficiently. In extreme cases of massive scale, some organizations may even consider training their own model from scratch—but this is typically reserved for the largest players with very substantial budgets.

Use Case Matters (Quality vs. Cost Efficiency)

Choosing the right model shouldn’t be based solely on cost—it also depends on the quality of output required for your use case. In a document analysis scenario, precision in extracting information is the top priority. A lower-cost or open-source model may be sufficient here, especially if fine-tuned to the task. A model with 7B–13B parameters can offer adequate performance at a much lower cost. Moreover, when processing sensitive documents (e.g., contracts), running the model locally ensures that the content never leaves your organization—an invaluable benefit from a legal and data privacy standpoint. On the other hand, in customer inquiry handling, where natural language quality, politeness, and contextual understanding are critical, GPT-4 can significantly outperform smaller models. In this case, a company may find it worthwhile to pay more for superior customer experience.

Hidden Costs Around the Project

It’s important to note that the above calculations cover only the technical costs—such as token usage or infrastructure. In practice, there are also “soft” costs to consider, including staff time for preparing the implementation, integrating the model with systems like a CRM or knowledge base, testing, and ongoing iterations and improvements. For example, if the assistant needs to retrieve data from a company’s internal document repository, those documents often need to be organized or cleaned before they can be effectively used by the model.

Cost Example: AI Assistant for Analyzing Emails and PDF Documents

Here we also present the cost breakdown of our assistant based on Google’s Gemini model, which we described [here]. Its task is to automatically analyze incoming emails to identify insurance policies and extract key data from attached PDF documents—such as policy number, insured party address, or payment confirmation.

Average Token Count per Email:

- Input: 3,500 tokens
- Output: 220 tokens

Analyzing 100 emails with attachments using the Gemini 2.0 Flash model costs approximately $1.50.

Summary

Can We Afford Our Own “ChatGPT” in the Company? As we’ve seen, the answer is: it depends—primarily on the scale of usage and quality requirements. The key lies in selecting a model and deployment method that aligns with your specific needs. An iterative approach is often the most practical: start with a lower-cost model or API, evaluate the results, and scale up to a more powerful model or self-hosted solution as the project matures. Regardless of the path you choose, careful planning and cost monitoring across all categories is essential. We hope this comparison helps you make informed decisions and prepare a realistic budget for implementing a dedicated LLM in your organization.

If you’re considering implementing an assistant in your company, it’s worth finding answers to the following questions:

- Do I need high-quality responses (e.g., GPT-4), or is an approximate answer sufficient (e.g., Claude Haiku, Gemini Flash)?
- Am I processing sensitive data (e.g., customer documents)?
- Do I have an IT team capable of hosting a model in-house?
- What is the expected number of queries per day/month?
- Is it more cost-effective to maintain my own infrastructure, or should I pay for API access?

For small to medium-scale applications, the cost of using a dedicated LLM can be quite reasonable. Thanks to cloud-based services, it’s possible to get started for just a few dozen dollars per month with models like GPT-3.5 or Claude Instant—an excellent option for experimentation and early prototypes. If you need top-tier performance, such as what GPT-4 offers, you’ll need to account for higher costs. However, even a few hundred dollars per month can be justified if the business value is significant—for example, by automating tasks that would otherwise require many hours of manual work.

On the other hand, for large companies planning intensive AI use, costs can grow exponentially—making it worth considering open-source options and greater investment in in-house infrastructure. Open models like LLaMA or Mistral offer freedom from per-token fees, but shift the cost burden to hardware and staffing. They become cost-effective when operating at scale or when full control over data is a top priority.

Looking to Bring AI Tools into Your Company?

We offer comprehensive technology support in the field of artificial intelligence and AI agents. Tell us about your idea!

Artykuł LLM Implementation and Maintenance Costs for Businesses: A Detailed Breakdown pochodzi z serwisu Inero Software - Software Consulting.

Chatbot, Agent or AI Assistant? Find Out Which Solution Is Best for Your Business

Marta Kuprasz — Thu, 08 May 2025 08:57:21 +0000

Artificial intelligence and Large Language Models are buzzwords heard in nearly every industry. Many companies are wondering how to use them safely and which solution will be the most effective. There are plenty of options—and they’re often hard to tell apart. In this article, we break them down in a clear and easy-to-understand way.

AI can take on many roles in a company—as a chatbot, assistant, agent, data analysis tool, content generator, or knowledge search engine. So how can you choose the solution that best fits your employees’ needs? It helps to understand what each option has to offer.

Chatbot – answers questions, provides explanations, and handles requests

This is the most common use of AI in areas such as customer service and sales. An AI chatbot based on a large language model, such as ChatGPT, can hold natural conversations, understand the context of inquiries, and deliver accurate answers—24/7, in multiple languages, and without human involvement.

These solutions are typically implemented on websites, in messaging platforms (like Messenger or WhatsApp), or within helpdesk systems, where they assist with answering questions, tracking orders, or providing product information. As a result, they significantly automate customer service, reduce operational costs, and improve customer satisfaction ratings.

For the purposes of this article, we define a chatbot as an AI interface primarily intended for external users—in other words, it operates “outside the company.” This definition distinguishes it from AI agents, which perform more complex tasks within internal processes by integrating with systems, databases, or APIs.

https://www.incone60.eu/seastat

AI Agent – a tool designed to carry out specific tasks

Unlike a chatbot, which interacts with external users, an AI agent operates within the organization and supports employees by automating specific business processes. It’s not a one-size-fits-all tool—it’s built with a clearly defined purpose in mind, such as document processing, data analysis, or integration with ERP systems.

Thanks to large language models like Gemini or Claude, an AI agent can understand context, make decisions, and trigger specific actions—without human input. It can run in the background, process data from multiple sources, manage files, or handle email inboxes. Each AI agent is tailored to the company’s individual needs and specific tasks. Only then can it offer real value instead of becoming just another generic tool.

Want to see how this works in practice?

Check out our case study: Meet your personal AI agent-a case study for a freight forwarding company – where we describe how we built an agent integrated with an email inbox.

AI Assistant – supports users in daily work by operating contextually and “in the background”

Unlike a chatbot that answers questions or an agent that automates a specific process, an AI assistant is a tool that works alongside employees in real time—it understands context, suggests next steps, and makes tasks easier within familiar applications.

It’s typically integrated into a specific work environment, such as a word processor, spreadsheet, CRM, or project management tool. The assistant doesn’t replace the user—it actively supports them in making decisions, writing, analyzing data, or planning.

AI assistants like GitHub Copilot, Notion AI, or Google’s Workspace assistant show how this technology can genuinely boost team productivity and reduce time spent on routine tasks. From a business perspective, a well-designed assistant can improve work quality, reduce errors, and make onboarding new employees easier.

Other Business Applications of Large Language Models

The possibilities go far beyond chatbots, assistants, or agents. These models can take on specialized roles, supporting tasks such as document processing, data analysis, or content creation. They’re increasingly used to automatically summarize reports, extract information from unstructured sources (like emails, PDFs, or scanned forms), or answer natural-language questions based on internal documentation.

LLMs can also assist marketing teams by generating suggestions for ad copy, product descriptions, or sales messages tailored to the company’s style. In analytics departments, they provide faster access to data—generating database queries, interpreting results, and presenting insights in a way that’s easy for non-technical users to understand. These applications often don’t require building a new tool from scratch, but rather integrating the AI model into existing company systems. This way, the technology supports specific tasks—right where it’s needed.

AI Models and Data Security

Business owners and managers still approach AI tools with caution, mainly because they’re unsure how to ensure the security and confidentiality of processed data. We’ve explored these topics in previous publications that are worth reviewing.

In the article “AI User Privacy: An Analysis of Platform Policies”, we outlined the data privacy and model training policies followed by major AI providers such as OpenAI, Google Gemini, Microsoft’s Azure OpenAI, and Anthropic’s Claude.

For those considering an on-premise solution, we recommend the blog post “Top Lightweight LLMs for Local Deployment” There, we reviewed several top open-source lightweight LLMs and explained how to run them on a local Windows machine—even with limited GPU resources.

Choosing the right AI tool for your company depends primarily on the goal it’s meant to achieve. A chatbot works best where quick and accessible customer service is key. An AI agent can automate repetitive internal processes and improve information flow between systems. An AI assistant provides day-to-day support for employees—offering suggestions, summaries, or preparing data for further use.

Large language models also allow integration with existing processes—without the need to build a dedicated tool from scratch. However, implementing AI-based technology requires a well-thought-out decision, taking into account both efficiency and data security. If you’re looking to adopt AI in your company and need an experienced partner to guide you through the process, get in touch with us.

Bring AI into Your Business

We provide professional consulting and end-to-end implementation of tools based on large language models.

Artykuł Chatbot, Agent or AI Assistant? Find Out Which Solution Is Best for Your Business pochodzi z serwisu Inero Software - Software Consulting.

AI User Privacy: An Analysis of Platform Policies

Martyna Mul — Wed, 30 Apr 2025 08:35:35 +0000

Ever wondered where your data goes when you interact with AI cloud platforms? Or is it used to train future models? In this article, we’ll break down the data privacy policies of top AI platforms. You will also learn what to do to ensure your data is not used for training Large Language Models (LLM).

Major AI cloud providers have become increasingly transparent about their data usage policies – especially when it comes to training models. While most platforms, particularly those offering enterprise-level services, do not use your inputs and outputs for training by default, the fine print matters. Understanding how these services handle your data – and how you can maintain control – is essential.

In this article, we’ll break down the data privacy and model training policies of top AI platforms, including OpenAI, Google Gemini, Microsoft’s Azure OpenAI and Anthropic’s Claude. You’ll learn:

- How AI platforms use your data and whether your data is used to train models by default
- How to prevent AI from using your data opt, if needed
- Where your data is stored (data residency), and
- What compliance measures (like GDPR) apply

Adopting AI isn’t just about prompt engineering or model performance. It’s also about knowing where your data goes—and how to ensure it stays under your control.

Here’s what you need to know:

OpenAI – Data Usage and Privacy

OpenAI treats your data differently based on how you interact with its services:

ChatGPT App (Web/Mobile)

When you chat with ChatGPT, your conversations may be used to train AI models – unless you manually opt out. To prevent your data from being used:

- Go to Settings → Data Controls → Improve the model for everyone and toggle it off.
- Even with the opt-out, OpenAI stores chats for 30 days for abuse monitoring before deletion.

OpenAI API and ChatGPT Enterprise

If you’re a developer or a business using OpenAI’s API or ChatGPT Enterprise, there’s no need to opt out. By default, OpenAI does not use API or Enterprise data to train its models, and your data stays private. You don’t need to do anything to opt out – it’s already protected. You can choose to share data to help improve the model, but only if you want to.

Data Residency

OpenAI’s servers are mostly based in the United States, and currently, if you’re using the API directly, you can’t choose where your data is stored. That means your data is processed within OpenAI’s own infrastructure – protected by strong security, but not necessarily hosted in your country.

However, there’s some progress for enterprise users. OpenAI recently introduced an option for eligible enterprise API customers that allows data to be stored in Europe, provided there’s a specific agreement in place.

If regional data residency is important for your business – say, for GDPR or internal compliance – you might want to consider using Azure OpenAI, which hosts OpenAI’s models on Microsoft’s cloud. With Azure, you can choose a region like Western Europe or Asia, and all data processing and storage will stay within that geography.

We’ll dive into Azure more in the next section – but in short: OpenAI handles your data securely, but for strict control over where it lives, a partner cloud service like Azure may be a better fit.

Google (Gemini) – Google’s Approach to Your Data

Google’s foray into generative AI includes Gemini, a next-generation model that powers products like Google Gemini (the chatbot) and various enterprise AI offerings on Google Cloud. Here’s how they handle your data:

Gemini App

By default, Google does save your Gemini chat history to your account (much like search history) and may use it to improve their service. However, Google provides a “Gemini Activity” setting to control this.

To manage this:

- Visit Gemini Activity settings.

- Pause Gemini Activity to stop saving chats and prevent them from being used in AI model training data sources.

- You can also delete existing conversation history.

Turning off Gemini Activity means your new chats won’t be used to improve their machine learning services, nor will they be seen by human reviewers, unless you explicitly submit them as feedback. This gives regular users a way to opt out, similar to ChatGPT’s opt-out toggle.

To stop saving your conversations, go to the Activity tab and toggle Gemini Apps Activity. You can also delete your past conversations.

API and Vertex AI

If you’re using Google Cloud’s Vertex AI platform:

- Your prompts and outputs are not used to train AI models without explicit permission.

- Data may be cached briefly (up to 24 hours) for performance but remains within your selected geographic region.

- Businesses can opt for a zero-retention policy for maximum privacy.

Data residency

Data residency is a strong point for Google: you can choose which geographic region your AI service runs in (e.g. EU or US data centers), and Google will process and store data in that region to meet any data localization requirements.

Microsoft Azure OpenAI – Enterprise Data Protection by Design

Training Policy

Microsoft’s Azure OpenAI Service lets companies use OpenAI’s models through the trusted Azure cloud platform. Privacy is a major selling point here. Microsoft is very explicit: any data you send into Azure OpenAI is not used to train the underlying models or improve Microsoft’s or OpenAI’s services .

Microsoft’s Azure OpenAI Service essentially hosts OpenAI’s models (GPT-4, GPT-3.5, etc.) within the Microsoft Azure cloud. Microsoft has specifically designed this service for enterprises that require strong privacy protections. Key aspects are:

- Any data you input into Azure OpenAI – prompts, completions (model outputs), embeddings, fine-tuning data – is not used to train the AI models.

- Your inputs and outputs “are NOT available to other customers, are NOT available to OpenAI, and are NOT used to improve OpenAI models”.

- Microsoft only retains data as needed to provide the service and monitor for misuse. In fact, prompts and outputs on Azure are stored only temporarily (up to 30 days) by default, and solely for abuse detection purposes. After 30 days, those prompts are deleted. If even this temporary storage is a concern (say, for ultra-sensitive data), Microsoft offers a process called “modified abuse monitoring” where you can request that even the 30-day storage be bypassed, meaning no prompts are retained at all. Typically, you’d need approval for this exception, but it’s an option for high-security scenarios.

Data Residency

Because it’s on Azure, you also benefit from easily choosing the region and complying with data residency requirements. When setting up Azure OpenAI, you deploy the service to an Azure region (for example, East US, West Europe, Southeast Asia, etc.). All processing and data storage for inference will occur within that region or its geographical boundary. So, if you deploy in Western Europe, your data isn’t leaving Europe – crucial for GDPR compliance. Azure itself meets numerous compliance standards (SOC 2, ISO 27001, etc.), and these extend to Azure OpenAI as an Azure service.

Anthropic (Claude) – A Privacy-First AI Assistant

Training Policy

Anthropic, the company behind the Claude AI assistant (Claude 2 and newer versions), has emphasized a privacy-conscious approach from the outset. Anthropic adopts an opt-in approach:

- By default, Anthropic does not use your conversations or data to train its models. This applies to both their commercial offerings (Claude for Work, Anthropic API) and consumer products (Claude Free, Claude Pro) – your prompts and Claude’s responses aren’t automatically used for model training.

- They only use data if you deliberately opt-in, such as by providing explicit feedback. For instance, if you click a thumbs-up/down in a Claude interface or send data to their feedback channels, you’re essentially saying “you can learn from this”.

For enterprise clients, Anthropic offers Claude Team/Enterprise, which not only guarantees no training on your data but also provides admin controls. One such feature is custom data retention settings. By default, Anthropic’s systems might retain your inputs/outputs indefinitely for your account (though not for training). However, Claude Enterprise admins can set a retention policy – for example, you might set it to delete all conversation data after 30 days, 60 days, etc., with 30 days being the current minimum. These controls aim to support compliance with regulations like GDPR.

Data Residency

Anthropic is a newer player, and currently, when you use their API directly, you don’t explicitly choose a data region – it’s likely hosted in the US by Anthropic (or possibly through cloud providers like AWS in the US region). However, Anthropic models are also available through partners, which can help with data residency. For example, Anthropic’s Claude is offered via Amazon Bedrock (AWS’s AI service) and via Google Cloud Vertex AI. If you use Claude through one of these platforms, you can take advantage of AWS’s or Google’s region controls.

Conclusion

Understanding the data collection practices of LLM providers is crucial for AI compliance, customer trust, and corporate governance. Whether you’re focused on compliance, customer trust, or internal data governance, these insights help you make informed decisions. Choose providers that align with your privacy values – and always review your settings.

Here’s a comparison of major platforms:

Provider	Default Data Training	Web App Setting	Data Residency Options	GDPR/CCPA Compliance	Privacy Policy
OpenAI	No (API)	Opt-out available	No; (unless used via Azure Microsoft)	Yes	Consumer privacy
Google	No (Cloud + Gemini)	No training by default	Broad region control	Yes	Enterprise privacy, Gemini privacy, Vertex AI
Azure	No	N/A	Full regional control	Yes	Azure, OpenAI privacy
Anthropic	No	No training by default	No (unless used via partners)	Yes	API users, Claude.ai users

For maximum privacy and control, local deployment (on-premises models) is always an alternative. This avoids cloud storage concerns entirely. You can read more about local deployment here.

Let's talk about AI agents

Ready to bring AI into your business? Let us help you get started.

Artykuł AI User Privacy: An Analysis of Platform Policies pochodzi z serwisu Inero Software - Software Consulting.

Top Lightweight LLMs for Local Deployment

Martyna Mul — Thu, 17 Apr 2025 09:50:46 +0000

Running large language models (LLMs) on your own hardware has become increasingly feasible thanks to lightweight LLMs—models with relatively small parameter counts that deliver strong performance without requiring server-grade GPUs. In this post, we’ll explore several top open-source lightweight LLMs and how to run them on a local Windows PC—whether CPU-only or with a limited GPU—for document processing tasks. We also include a benchmark comparing the models in terms of accuracy and inference speed, helping you choose the right model for your local environment and use case.

What Are Lightweight LLMs (and Why Run Them Locally)?

“Lightweight” LLMs are models typically in the range of ~1–8 billion parameters – far smaller than GPT-3 class models – often optimized to run on a single GPU or even CPU. They are usually released as open models with freely available weights. These models trade some raw power for efficiency, but recent research and clever engineering (better data, distilled training, efficient attention mechanisms, etc.) have dramatically improved their capabilities. Many can now match or beat much larger models on specific benchmarks.

Local deployment of such models is valuable for several reasons:

- Privacy & Security: All data stays on your machine, which is crucial for confidential documents like insurance contracts. You’re not sending sensitive text to a third-party API.

- Cost Savings: Once downloaded, local models run for free – no API usage fees or cloud compute bills. This can make a big difference if you process large volumes of documents regularly.

- Latency & Offline Access: Local inference eliminates network latency. Responses can be near-instant on a GPU, and you can operate entirely offline. This is useful for on-site workflows or when internet access is restricted.

- Customization: With local models you have full control – you can adjust parameters, prompts, or fine-tune models to better fit your domain (e.g. insurance data) without vendor limits.

In short, lightweight LLMs put AI capabilities directly in your hands, on hardware you own. Next, we’ll compare some of the leading open models that are well-suited for local document processing.

Comparing Top Lightweight LLMs

Lightweight open-source large language models (LLMs) are becoming a practical choice for organizations looking to run AI workloads locally. They offer a strong balance between performance, speed, and resource requirements—making them ideal for document summarization, extraction, and classification without relying on cloud infrastructure.

We’ll focus on the following open-source models (each with downloadable checkpoints) that have a good reputation for quality relative to their size:

- Llama 3.1 – 8B parameters (Meta AI)
- StableLM Zephyr – 3B parameters (Stability AI)

- Llama 3.2 – 1B/3B parameters (Meta AI)

- Mistral – 7B parameters (Mistral AI)

- Gemma 3 – 1B and 4B variants (Google DeepMind)

- DeepSeek R1 – 1.5B and 7B variants (DeepSeek AI)

- Phi-4 Mini – 3.8B parameters (Microsoft)

- TinyLlama – 1.1B parameters (community project)

These models range from very small (under 1 GB on disk) to mid-sized (~5 GB). All can be run in inference mode on a 16 GB GPU (often even in half-precision or 4-bit quantized form) and many are workable on CPU with enough RAM and patience. Table 1 summarizes their characteristics:

Model	Size on Disk (quantized)	Max Context	Licence
Llama 3.1 (8B)	4.9GB	128k tokens	Open-source
StableLM Zephyr (3B)	1.6GB	4k tokens	Only non-commercial use
Llama 3.2 (3B)	2.0GB	128k tokens	Open-source
Mistral (7B)	4.1GB	32k tokens	Open-source (Apache 2.0)
Gemma 3 (4B)	3.3GB	128k tokens	Open-source
Gemma 3 (1B)	0.8GB	32k tokens	Open-source
DeepSeek R1 (7B)	4.7GB	128k tokens	Open-source (MIT licence)
DeepSeek R1 (1.5B)	1.1GB	128k tokens	Open-source (MIT licence)
Phi-4 Mini (3.8B)	2.5GB	128k tokens	Open-source
TinyLlama (1.1B)	0.6GB	2k tokens	Open-source

Table 1: Lightweight LLMs for local use – model sizes and maximum context window.

Notes: “Max Context” is the maximum sequence length (tokens) the model can process in one go.

Next, let’s look at each model’s pros and cons, especially in the context of document tasks:

- Llama 3.1 (8B): Powerful general-purpose model; moderate size and strong multilingual capabilities. Heavy for CPU-only systems; requires chunking for long documents.

- StableLM Zephyr (3B): Ultra-lightweight, good for basic QA/extraction. Limited by small parameter count and commercial license restrictions.

- Llama 3.2 (3B): Excellent summarization and retrieval; long context support (128k tokens). Smaller size affects complex reasoning accuracy.

- Mistral (7B): Best overall performer for its size; highly efficient inference. Ideal for detailed summarization tasks.

- Gemma 3 (4B/1B): Offers multimodal capabilities and extensive multilingual support. The 4B model balances capability and speed; the 1B model best suited for simple tasks.

- DeepSeek R1 (7B/1.5B): Balanced efficiency and comprehension for general NLP tasks; limited complex reasoning compared to Mistral.

- Phi-4 Mini (3.8B): Exceptional reasoning, math, and logical capabilities; perfect for analytical document processing. English-focused.

- TinyLlama (1.1B): Extremely lightweight; suitable for basic text extraction/classification tasks. Limited contextual understanding.

The models reviewed above cover a wide range of sizes and capabilities. Larger variants like Llama 3.1 and Mistral perform well on complex summarization and multilingual tasks but are less suited for CPU-only setups. Mid-sized models such as Llama 3.2 and Gemma 3 (4B) handle long inputs efficiently with reasonable performance. Smaller models, including TinyLlama and StableLM Zephyr, are lightweight and fast, making them practical for basic extraction or classification tasks.

Models Benchmarking: Document Extraction and Summarization

Here we outline a simple model benchmarking plan covering two common document-processing tasks:

Information Extraction: We evaluated how well each model can extract specific fields from a policy or certificate. Specifically, we prompted each model to find the policy number, insured name, VAT ID, address and insurance period in the document text and return the structured output – clean JSON response with all the needed values.
Summarization: Each model generated a concise summary of an insurance policy, covering key points such as coverage, exclusions, and conditions.We rated the summaries on clarity, correctness, factual accuracy and readability and penalized heavily fabricating information.

We used 11 documents and ran all tests using Ollama (you can read about running model with Ollama here). The benchmarks were performed on a PC equipped with an NVIDIA GeForce RTX 2060 and 6 GB VRAM. To ensure consistent results, each model was run with temperature set to 0 for the extraction task (to produce deterministic outputs), and with a fixed temperature of 0.7 for summarization. For the extraction task, we also used structured outputs:

 

{ 
        "model": "deepseek-r1:7b", 
        "prompt": "You are an assistant that extracts insurance-related information from a given input text. You must extract and return only the following fields: - policy_number,- insurance_period,- insured (company or person name),- nip (tax identification number),- address (of the insured). Return the output as a **clean JSON object** — not as a string, not inside quotes, and without any commentary. If a field is missing, use 'Not found'. Document text: ", 

    "stream": false, 
    "format": { 
    "type": "object", 
    "properties": { 
      "policy_number": { 
        "type": "string" 
      }, 
      "insurance_period_start": { 
        "type": "string" 
      }, 
      "insurance_period_end": { 
        "type": "string" 
      }, 
      "insured": { 
        "type": "string" 
      }, 
      "insured_nip": { 
        "type": "string" 
      }, 
      "insured_address": { 
        "type": "string" 
      } 
    }, 
    "required": [ 
      "policy_number", 
      "insurance_period_start",  
      "insurance_period_end", 
      "insured", 
      "insured_nip", 
      "insured_address" 
    ] 
  } 
}

Examples of insurance certifacates.

The table below presents the benchmark results. Extraction accuracy refers to the number of documents (out of 11) where the model successfully extracted all key fields. Token/sec indicates the model’s inference speed — how quickly it generates responses.

Model	Summarization	Extraction Accuracy	Tokens/sec
Llama 3.1 (8B)	High-quality, no hallucinations	10/11	13.49
StableLM 3B	Average quality, typos/hallucinations	4/11	56.51
Llama 3.2 (3B)	Concise yet comprehensive summary, no hallucinations	8/11	49.49
Mistral 7B	Extensive summary, factually correct	8/11	29.01
Gemma 3 4B	Concise yet comprehensive summary, no hallucinations	10/11	13.37
Gemma 3 1B	Concise yet comprehensive summary, no hallucinations	4/11	73.46
DeepSeek 7B	Concise yet comprehensive summary, no hallucinations	6/11	16.39
DeepSeek 1.5B	Very poor, frequent hallucinations/errors	0/11	66.45
Phi-4 Mini 3.8B	Very concise summaries, factually correct	9/11	39.31
TinyLlama 1.1B	Poor quality, severe hallucinations	2/11	107.34

Table 2: Benchmarking results.

This scatterplot visualizes the trade-off between extraction accuracy and inference speed (measured in tokens per second)

The benchmarking results reveal significant variations among the tested models.

- Bottom-right models – Llama 3.1 (8B), Gemma 3 (4B), and Phi-4 Mini (3.8B) – excel in summarization quality and extraction accuracy, consistently providing concise and accurate outputs. Phi-4 Mini seems to offer a good trade-off between speed and accuracy.

- Mistral 7B, DeepSeek 7B, Llama 3.2 generate detailed and informative summaries, though their extraction performance is more moderate.

- On the other hand, smaller models (on the top-left side of the chart) like StableLM Zephyr (3B), Gemma 3 (1B) and TinyLlama (1.1B) show significantly weaker extraction accuracy and are prone to frequent hallucinations. However, they benefit from faster inference times. Their limited context windows (e.g., 4k tokens) may contribute to these shortcomings. Overall, they may be suitable for only very basic tasks.

Choosing the Right Model for Your Needs

When selecting a language model for document extraction or summarization, it’s all about balancing accuracy, speed, and hardware constraints. Below is a quick breakdown to help you pick the best fit—whether you need high precision, fast inference, or something lightweight for basic tasks.

- High Accuracy & Reasonable Speed: Choose Phi-4 Mini (3.8B), Gemma 3 (4B), or Llama 3.1 (8B) for robust extraction and summarization accuracy.

- Fast Inference & Moderate Accuracy: Opt for Llama 3.2 (3B) or StableLM Zephyr (3B) for simpler tasks on limited hardware.

- Balanced Performance (Accuracy-Speed Tradeoff): Mistral (7B) provides strong general-purpose capability suitable for detailed document summarization tasks.

- Low Resource Environments (Basic Tasks): Consider TinyLlama (1.1B) for quick extraction or classification on minimal hardware if accuracy isn’t critical.

Conclusion

Lightweight LLMs are increasingly viable solutions for local deployment, particularly in document-intensive industries such as insurance. Models such as Phi-4 Mini, Gemma 3 (4B), and Mistral 7B provide strong performance in summarization, extraction, and classification tasks. Carefully balancing model size, inference speed, and accuracy ensures optimal outcomes, empowering organizations with affordable, private, and responsive AI solutions directly on owned hardware.

This might interest you

Optimization of Back-Office Processes with AI Agent Implementation: A Practical Example

Read the full text

Artykuł Top Lightweight LLMs for Local Deployment pochodzi z serwisu Inero Software - Software Consulting.

How to Prepare Your Company for AI Agent Implementation

Marta Kuprasz — Tue, 08 Apr 2025 08:45:46 +0000

Implementing an AI agent in a company is not only a technological challenge but also a strategic one. As more businesses consider using artificial intelligence in their daily operations—from customer service to document analysis—successful implementation requires careful planning. This article explains what to focus on before deploying an AI agent, which areas of the business need to be well-prepared, and how to avoid common mistakes.

There are many areas where AI can be helpful. From automating routine tasks, supporting customer service and data analysis, to streamlining decision-making processes and creating intelligent assistants that support team workflows. The potential is enormous—but the key lies in properly preparing the organization for this change.

Stages of AI Assistant Implementation

The process of implementing an AI assistant in an organization can be divided into several stages, each requiring specific actions. From analyzing business needs, selecting the right language model, and preparing the infrastructure, to integrating with existing systems and testing—each step impacts the overall effectiveness of the solution.

The key stages are:

Needs analysis and readiness assessment
Data and content preparation
Solution design
Assistant development and configuration
Testing and pilot phase
Deployment and maintenance

Needs analysis and readiness assessment

To ensure the best results from implementing an AI agent, start by asking yourself: which tasks and areas have the most potential for optimization through the use of artificial intelligence?

When looking for an answer to this question, it’s worth carefully analyzing your company’s current structure, processes, and employee responsibilities. This will help identify so-called “bottlenecks” that may affect the quality of services provided. These might include, for example:

- long response times to quote requests
- teams overloaded with routine tasks
- lack of consistency in customer communication
- manual processing of documents and data
- difficulties in quickly accessing internal company knowledge

Based on this analysis, you’ll be able to identify areas for improvement as well as the people who will directly benefit from the support of AI assistants.

The second area that should be reviewed is the existing infrastructure. Implementing an AI assistant doesn’t require a large amount of hardware. If the company doesn’t want to invest in new machines, it can opt to use cloud services such as Azure, AWS, or Google Cloud.

Data is a crucial part of the preparation process. To fully leverage the potential of dedicated AI solutions, it’s important to understand that training the model behind the assistant requires datasets stored in digital form. These should be well-organized and kept in a central repository or database. The less structured the data, the higher the cost of implementing the assistant—and the greater the risk that the solution won’t meet expectations.

Data and content preparation

At this stage, it’s essential to gather all materials that contain important company knowledge—this may include PDF, Word, and Excel documents, website content, FAQ sections, emails, or data from databases.

Next, the collected information needs to be properly prepared—organized, cleaned of unnecessary content (e.g., unreadable PDFs), standardized where possible, and exported to CSV or JSON files (e.g., emails).

In some cases, such as when planning further model customization (fine-tuning), it will also be necessary to label the data or prepare a dedicated training set in the form of instructions and expected responses, for example:

{"prompt": "What documents are required to sign an OCS agreement?", "response": "The following documents are required to sign an OCS agreement: ..."}

Solution design

At this stage, decisions are made about the technical design of the solution. It’s important to define what type of assistant will best meet the company’s needs—whether it’s a simple chatbot answering questions, a more advanced assistant with access to company knowledge (so-called RAG – Retrieval-Augmented Generation), or an agent capable of independently performing specific tasks such as making bookings, generating reports, or sending emails.

The next step is selecting the appropriate technologies, including the large language model (LLM) that will power the assistant—such as GPT-4, Claude, Mistral, LLaMA, or Gemini—depending on specific needs and requirements related to privacy, cost, and integration capabilities.

Finally, it’s worth preparing a list of functions the assistant should perform and planning integration with other systems used in the company—such as the CRM, knowledge base, or email.

Assistant development and configuration

At this stage, both the technical backend and the user-facing part of the assistant (frontend) are developed. This could be, for example, a chat interface on the website, a button that launches the assistant in an application, or a widget integrated with tools like Slack. You can read more about how AI agent integration with the Slack communication platform can look here >>LINK

In parallel, the selected language model is deployed—via services such as Azure OpenAI, OpenAI API, Anthropic (Claude), Google Vertex AI (Gemini), or locally using open-source models like LLaMA, Mistral, or Mixtral.

If the assistant is meant to use internal company knowledge, a RAG (Retrieval-Augmented Generation) mechanism needs to be configured—enabling it to search and match relevant documents to user queries.

Finally, integrations with other systems—such as CRM, ticketing systems, or email—are implemented, allowing the assistant to meaningfully support the team’s day-to-day work.

Testing and pilot phase

After implementation, thorough testing of the solution is essential. The first step is functional testing—checking whether the assistant correctly understands user intent, responds in line with company documentation, and handles different types of queries appropriately.

The next phase is testing with end users (UAT – User Acceptance Testing), which helps assess how well the assistant performs in real-world scenarios and whether it meets employees’ expectations.

Based on feedback and observations, iterative improvements are made—such as adjusting responses, adding new documents to the knowledge base, or refining prompts and the agent’s logic. This phase is often repeated several times until a satisfactory level of quality is achieved.

Deployment and maintenance

After completing the testing phase, the assistant is deployed to the target infrastructure—this may be a public cloud (e.g., Azure, AWS, GCP), on-premise servers, or a hybrid solution, depending on security and availability requirements. More about this is covered later in the article.

It’s also necessary to set up monitoring, which allows you to track things like token usage, query frequency, error rates, and the quality of generated responses. This enables quick issue resolution and cost optimization.

In daily use, it’s important to keep the data up to date—adding new documents, removing outdated information, and updating the knowledge base the assistant relies on.

Over time, as business needs evolve, it may be worth considering retraining or fine-tuning the model—e.g., every few months—to better align it with the organization’s specific context.

Finally, it’s important to provide technical support and user assistance to ensure the solution is not only technically reliable but also convenient and intuitive for everyday use.

Data privacy

In the “Deployment and maintenance” section, we discussed the available options for choosing the infrastructure on which the AI agent will be deployed.

Each solution has its pros and cons. Choosing an on-premise setup gives you full control over the data, but it requires a dedicated machine with specific parameters.

Another option is using a public cloud service, such as Azure. Microsoft clearly states that data submitted to the Azure OpenAI service is not used to train or improve OpenAI or Microsoft models (source).

According to Microsoft, prompts and responses are not shared with other customers or OpenAI. Azure operates in full isolation mode: when using GPT-4 on Azure, no information from your conversations is shared with OpenAI LLC. Microsoft has confirmed this in a Data Processing Addendum (DPA).

AI decision accountability

It’s important to remember that formal and legal responsibility for the outcomes of an AI agent’s actions and the data it processes lies with the entity that implemented and oversees the solution—most often.

the organization (e.g., the company that deployed the assistant),
the system administrator,
the individual making decisions based on AI suggestions (e.g., a customer service representative, recruiter, or doctor).

How to reduce risk?

Human-in-the-loop (HITL) – A human must approve important decisions, while AI only supports the process (e.g., the assistant drafts a response, but a person approves it).
Clear disclaimers and warnings – The AI should inform users: “I am an AI assistant – please verify my responses before making a decision.”
Source verification – The AI assistant should, where possible, cite sources for its answers or indicate when it doesn’t know rather than guessing. Using RAG enables precise control over the knowledge base.

Summary

The process of implementing an AI agent must be well-planned and carefully considered. It may seem challenging at first, but with proper preparation, it can deliver long-term benefits. If you need support, feel free to contact us.

AI Agent in Your Company?

Write to us and find out how an AI Agent can support your company.

Contact

Artykuł How to Prepare Your Company for AI Agent Implementation pochodzi z serwisu Inero Software - Software Consulting.

Deploying LLMs Locally: A Guide to Ollama and LM Studio

Martyna Mul — Fri, 04 Apr 2025 08:53:42 +0000

Local deployment of Large Language Models (LLMs) is becoming increasingly popular among developers, tech enthusiasts, and professionals in industries like insurance and transport. Unlike cloud-based APIs, local LLM deployment offers greater privacy, offline accessibility, and complete control over resource optimization and inference performance.

Running models like Llama 2 or Mistral directly on your hardware means your data stays on your machine — ideal for privacy-sensitive tasks such as processing insurance documents or working with proprietary transport data. There are no recurring API costs, and the performance depends solely on your system. Whether you’re building a custom chatbot, agent, an AI-powered code assistant, or using AI to analyse documents offline, local deployment empowers you to experiment and innovate without relying on external services.

In this guide, we’ll explore two powerful tools that make this possible: Ollama and LM Studio. We’ll walk through installation, usage, and customization, helping you pick the best option for your goals.

Getting Started with Ollama (CLI Tool)

Ollama is a lightweight, open-source command-line tool for running LLMs locally. It acts as a model manager and runtime, making it easy to download and execute open-source models (like Llama 2, Mistral, CodeLlama, etc.) on your machine. Ollama is available for macOS, Linux, and Windows, and it includes a local REST API for integration into applications.

1. Install Ollama on Your System: Download the installer for your platform from the official Ollama website or use a package manager.

On Windows, download the OllamaSetup.exe from the website and run it. On Linux, you can install Ollama with one command:

curl -fsSL https://ollama.com/install.sh | sh

After installation, open a terminal/command prompt and verify it’s installed by checking the version:

ollama --version

This should display the installed Ollama version, confirming it’s ready to use, e.g.:

ollama version is 0.6.2

2. Download an LLM Model (“Pull” a Model): Ollama has a built-in model library. You can search their catalog on the website or simply pull a known model by name. For example, to download the 7B parameter Llama 2 chat model, run:

ollama pull llama2:7b-chat

This command fetches the model weights to your machine (it may take a while, as models are multiple GBs in size). You only need to pull a model once; afterward it’s stored locally. You can list all downloaded models with ollama list if needed.

3. Run the Model Locally: Once downloaded, you can execute the model with the ollama run command. This will launch an interactive session where you can enter prompts and get responses. For example:

ollama run llama2:7b-chat >>> What is the capital city of Poland?

After running the above, Ollama will load the model and you’ll see an >>> prompt. You can then type your questions or instructions. The model (here Llama 2 7B chat) will generate a response to each prompt. For instance, you might ask “What is the capital of France?” and get an answer like “Paris is the capital of France.” printed in the terminal. Internally, the first run may take a bit to initialize, but subsequent prompts are answered interactively. Tip: You can also pass a one-off prompt directly in the command, e.g. ollama run llama2:7b “What is the capital city of Poland?“ will output a single response and return to the shell.

You can also start Ollama as a background server with ollama serve. This enables the REST API on localhost:11434, which developers can use to integrate the model into apps via HTTP calls. You can ask the model by sending POST request, e.g.:

curl http://localhost:11434/api/generate -d '{ 
  "model": "llama2:7b-chat", 
  "prompt": "What is the capital city of Poland?" 
}'

The API returns newline-separated JSON objects, chunk by chunk, as the model generates the response:

{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:17.1569954Z", 
 
    "response": "The", 
    "done": false 
} 
{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:17.268992Z", 
    "response": " capital", 
    "done": false 
} 
{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:17.3796491Z", 
    "response": " city", 
    "done": false 
} 
... 
{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:21.3106413Z", 
    "response": " Warszawa", 
    "done": false 
} 
{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:21.4619772Z", 
    "response": ").", 
    "done": false 
} 
{ 
    "model": "llama2:7b-chat", 
    "created_at": "2025-04-02T15:19:21.6296267Z", 
    "response": "", 
    "done": true, 
    "done_reason": "stop", 
    "total_duration": 5337417000, 
    "load_duration": 8625100, 
    "prompt_eval_count": 28, 
    "prompt_eval_duration": 854952300, 
    "eval_count": 15, 
    "eval_duration": 4472807400 
}

If you set stream: false, the response is a single JSON object:

curl http://localhost:11434/api/generate -d '{ 
  "model": "llama2:7b-chat", 
  "prompt": "What is the capital city of Poland?", 
  "stream": false 
}

You can also set a number of model parameters such as temperature by adding field options:

curl http://localhost:11434/api/generate -d '{ 
  "model": "llama2:7b-chat", 
  "prompt": "What is the capital city of Poland?", 
  "options": { 
 "temperature": 0.2   
  } 
  "stream": false 
}'

4. Customize Models: Ollama supports a Dockerfile-like syntax called a Modelfile to create custom LLM variants. These let you:

- Start from an existing model (like llama3)

- Add custom system prompts

- Inject user-defined data (e.g., instructions, context)

- Set model parameters, like temperature

Here is the simple example how you can create your custom assistant for processing insurance documents:

FROM llama2:7b-chat 
 
PARAMETER temperature 0.7 
 
SYSTEM """ 
 
You are an assistant that extracts insurance-related information from a given input text. 
 
You must extract and return only the following fields: 
- policy_number 
- insurance_period 
- insured (company or person name) 
- nip (tax identification number) 
- address (of the insured) 
 
Return the output as a **clean JSON object** -- not as a string, not inside quotes, and without any commentary. If a field is missing, use "Not found". 
Example output format: 
{ 
  "policy_number": "...", 
  "insurance_period": "...", 
  "insured": "...", 
  "nip": "...", 
  "address": "..." 
} 
 
""" 
TEMPLATE """ 
{{ .System }} 
Input: 
{{ .Prompt }} 
Response: 
""" Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

To use Makefile, save it in a directory, e.g. insurance-assistant and create the custom model:

ollama create insurance-assistant -f insurance-assitant/Modelfile

Then, you can use your model by providing the proper model name in a request:

 curl http://localhost:11434/api/generate -d '{ 
  "model": "insurance-extractor", 
  "prompt": "", 
  "stream": false 
}'

Ollama is purely CLI-based, so there’s no graphical interface. However, this makes it powerful for automation – you can pipe input/output, log responses to files, or call the Ollama API from code. In summary, with just a few commands, you have a privacy-protecting LLM running on your PC, ready to answer questions or assist in coding, all without any internet connection needed.

Getting Started with LM Studio (Desktop App)

LM Studio is a user-friendly desktop application that lets you download and run local LLMs via a graphical interface. It’s cross-platform (Windows, macOS, Linux) and ideal for beginners who prefer not to use the command line. With LM Studio, you can chat with models in a nice UI, manage model downloads, and even run a local server to use the model in other apps.

1. Install and Launch LM Studio: Download the installer for your OS from the LM Studio website and install it. After installation, launch the LM Studio app. The first time you open it, you’ll be prompted to download an AI model. You can choose from a list of popular open-source models. For example, you might select a smaller model like “Mistral 7B” or an instruction-tuned Llama2 variant to start.

2. Run Your First Chat: Once the model is downloaded, LM Studio will load it into memory. You can then start a new chat session in the app. The interface typically has a text box where you can enter your prompt or question, and the model’s response will appear in the chat window. Simply type a query (for example: “What’s the capital of France?” or “Explain quantum physics simply.”) and hit Enter. The AI’s answer will be displayed as the “Assistant” reply in the chat. LM Studio conveniently shows the generation metrics:

- number of input and output tokens,

- tokens per second – you can see how fast the model is generating text,

- context occupancy,

- system resources usage (RAM and processor usage).

3. Explore the Features: The LM Studio GUI provides additional features accessible to both beginners and advanced users:

- Model Library: A “Discover Models” or catalog section where you can download new models or update existing ones. You’re not limited to one model – you can have multiple models stored and switch between them. This means you have a wide selection: from small 3B parameter models for speed, up to 70B models if your system can handle them.

- Chat Interface: The main chat screen (as shown above) is where you interact with the model. Each new prompt you enter is answered by the model in a conversational format. You can have multi-turn dialogues, just like chatting with ChatGPT. There’s no need to manage a prompt history manually – the app keeps the conversation context.

- Advanced Settings: On the side panel, LM Studio offers configuration knobs for those who want more control. You can set a system prompt (a role or instruction that guides the AI’s behavior globally), adjust generation settings like temperature (creativity vs. consistency) and top-p or top-k sampling for controlling randomness, max tokens for responses, etc. These options let you fine-tune how the model responds without writing any code. For instance, you could set a system instruction like “You are a helpful coding assistant,”. This is a friendly way to customize behavior, though it’s not as extensive as programmatic control in a CLI tool.

Advanced settings – simple example of AI assistant for processing insurance documents

- Local API Server: For developers, LM Studio includes a “Local LLM Server” mode. Just switch to Developer tab, choose the model, and toggle Start button. It enables an API endpoint on localhost that mimics the OpenAI API, allowing other programs to send requests to your local model. This is powerful if you want to integrate the local LLM into your own applications (for example, connecting a chatbot UI or using the model for AI features in an IDE) while still benefiting from privacy and not relying on external services.

Developer tab – you can enable local LLM server hosting your customized LLM.

Using LM Studio is as simple as chatGPT – type and get answers – but entirely running on your hardware. The user-friendly interface lowers the barrier to entry, since you don’t need to use the terminal or remember commands. You get immediate, interactive AI responses, with buttons and menus to manage everything.

Ollama vs. LM Studio: Tool Comparison

Both Ollama and LM Studio let you run LLMs locally, but they cater to slightly different audiences and use-cases. Here’s a comparison of key aspects to help you understand their differences:

- Interface & Ease of Use: LM Studio provides a polished graphical user interface, which makes it extremely approachable for beginners. It’s point-and-click with an integrated chat window, so no technical knowledge is required to get started. Ollama, on the other hand, is a command-line interface (CLI) tool (with an optional REST API). It offers a lot of power and flexibility but does require comfort with the terminal to use effectively. Beginners might find Ollama’s learning curve steeper, whereas LM Studio feels more plug-and-play.

- Supported Models: Both tools support a wide range of open-source LLMs. LM Studio can load any model in GGUF format (the standard for llama.cpp), meaning models like Llama 2 (7B, 13B, 70B), Mistral, Vicuna, Alpaca, CodeLlama, etc., as long as you have the hardware for them

- Use Cases Suited: Because of the above differences, LM Studio is excellent for users who want a personal ChatGPT-like assistant on their PC with minimal setup. It’s great for interactive Q&A, brainstorming, or casual use – you launch it when you need it, type queries, get answers. Ollama is ideal for developers or those who want to incorporate LLMs into projects or workflows. If you plan to experiment with prompts in scripts, fine-tune model behaviors, or build an app (like a chatbot, a coding assistant integration, etc.) that calls a local model, Ollama’s CLI and API give you that flexibility.

Conclusion and Recommendations

Deploying LLMs locally has opened up a world of possibilities for developers and enthusiasts. We’ve discussed Ollama and LM Studio – two excellent tools that make local AI accessible. To recap some guidance on choosing between them:

- Choose LM Studio if you want a plug-and-play AI chat experience with a friendly GUI. It’s perfect for beginners or those who prefer not to tinker with command lines. You get quick setup, easy model downloads, and a nice chat interface for interactions. This might be best for someone who just wants an “offline ChatGPT” for personal use, note-taking, or idea generation without fussing over configurations. It’s also a convenient way to demo LLM capabilities to non-technical users (since it feels like a normal app).

- Choose Ollama if you want more control, automation, or integration. Developers and power users will appreciate its flexibility – you can script it, run it headless on a server, integrate the local LLM into your own apps via the API, and fine-tune model behavior with Modelfiles . If you’re comfortable with a terminal and want to customize how the AI works (beyond what a GUI allows), Ollama is a better fit. It’s also lightweight if you intend to run background AI services continuously.

Finally, remember that the LLM itself (the model you choose) is as important as the tool. Spend time finding a model that suits your task – whether it’s a concise summarizer or a creative storyteller – and fits your hardware. Both Ollama and LM Studio make it easy to swap models, so you’re not locked in. The ecosystem of open-source models is growing rapidly, which means running a powerful AI on your own device is only getting easier and more common.

In summary, deploying LLMs locally with these tools gives you the best of both worlds: AI capabilities similar to cloud services, but with privacy, control, and zero ongoing cost. Whether you go with a command-line power tool like Ollama or a user-friendly app like LM Studio, you’ll be joining the cutting edge of local AI development. Happy experimenting, and enjoy your new personal AI running right on your machine!

Artykuł Deploying LLMs Locally: A Guide to Ollama and LM Studio pochodzi z serwisu Inero Software - Software Consulting.

What are AI Agents and how can they help your company

Marta Kuprasz — Fri, 28 Feb 2025 09:51:15 +0000

The term artificial intelligence has been prominently featured in numerous publications as a solution to challenges related to efficiency, organization, and creativity. Many companies are following this trend, striving to incorporate AI-driven solutions into their offerings. These efforts take various forms. In this article, we will take a closer look at AI Agents, which can provide valuable support, particularly in back-office processes.

For some time now, we have been observing a significant rise in the popularity of terms related to the use of artificial intelligence. So, let’s start from the beginning.

What is "Artificial Intelligence"?

The term “artificial intelligence” encompasses Large Language Models (LLMs), natural language processing (NLP) systems, machine learning algorithms, neural networks, and generative AI models.

LLMs, such as ChatGPT from OpenAI or Gemini from Google, are models trained on vast datasets that can analyze, process, and generate text in a way that mimics human reasoning. They are used in various applications, ranging from chatbots and voice assistants to advanced systems supporting business analysis and process automation in companies.

Artificial intelligence is not limited to text processing. Modern models can also analyze images, audio, video, and numerical data, making them highly versatile tools in business. AI enables not only the automation of repetitive tasks but also the detection of patterns in large datasets, trend forecasting, and support for strategic decision-making in companies.

Who are AI agents?

“AI agents” are intelligent systems based on machine learning algorithms, natural language processing (NLP) models, and Large Language Models (LLMs). Their purpose is to automate processes, support decision-making, and interact with users in a natural and context-aware manner.

This means that virtual assistants are based on well-known and widely used LLMs such as ChatGPT, Gemini, Claude, Mistral, or DeepSeek, which can generate coherent responses, analyze texts, and adapt to the context of a conversation.

However, AI agents differ from language models in that they are designed to perform specific tasks autonomously. In practice, this means they are equipped with additional modules that enable them to gather information, process data in real-time, and make decisions based on business rules.

Unlike traditional chatbots, AI agents not only answer questions but can also handle complex processes, integrate with enterprise systems, and learn from user interactions. As a result, they are used in various areas, from administrative support and document analysis to the automation of operational processes in enterprises.

Also read: Meet Your Personal AI Agent – A Case Study for a Freight Company

The operation of AI agents is based on several key components:

- Communication interface – allows the agent to interact with users through text, speech, or other data formats.
- Decision engine – based on AI models and business rules, it enables situation analysis and the selection of optimal actions.
- Integration with external systems – AI agents often operate in conjunction with databases, business applications (ERP, CRM), or cloud services, allowing them to access up-to-date information.
- Process automation – they can perform specific tasks, such as generating reports, processing requests, sending notifications, or initiating predefined processes in IT systems.

What are the types of AI agents?

AI agents may take various forms depending on their application and level of autonomy. Leveraging advanced artificial intelligence models, they can assist users in a wide range of activities, from customer support to data analysis and business process management.

We can distinguish several main types of AI agents:

- Conversational agents – include chatbots and voicebots that interact with users through text or speech. They can answer questions, handle customer inquiries, and support sales processes.
- Analytical agents – specialize in processing and interpreting data. They use machine learning algorithms to analyze trends, detect anomalies, and generate reports.
- Operational agents – automate business tasks by integrating with enterprise systems. They can manage documentation, process documents, or coordinate activities within corporate processes.
- Autonomous agents – operate independently, making decisions based on collected data and predefined business rules. They are used in areas such as logistics, resource management, and dynamic operational planning.
- Decision-support agents – provide recommendations based on advanced data analysis, helping managers and specialists make strategic decisions.

Each of these types can operate independently or collaborate with other systems, creating a complex AI-driven environment. In the following sections, we will explore specific applications of AI agents and their impact on the operational efficiency of businesses.

Cloud or on-premise solution – how can an AI agent be implemented in a corporate environment?

Implementing an AI agent in an organization requires selecting the appropriate deployment model that best meets business, technical, and regulatory requirements. Companies can choose between a cloud-based solution (SaaS) or an on-premise deployment, depending on their needs for flexibility, security, and integration with existing systems.

The choice of the appropriate model depends on various factors, which are presented in the table below.

Comparison: SaaS vs On-Premise

Criterion	SaaS (Cloud)	On-Premise (Local)
Deployment model	Cloud-based (AWS, Azure, Google Cloud)	Operates on the company’s own infrastructure
Infrastructure	Cloud service provider	Local servers
Initial costs	Low	High
Operational costs	Subscription-based	Fixed maintenance and energy costs
Scalability	Very high	Limited (dependent on hardware)
Data security	Limited (processed outside the company)	High (full control over data)
Regulatory compliance	May require additional agreements and certifications	Easier to meet regulatory requirements
Ease of implementation	Easy and fast	Requires hardware purchase and setup
Updates and maintenance	Automatic, provided by the vendor	Self-managed updates and maintenance
Integration with enterprise systems	Strong API support and pre-built integrations	Full control but may require additional integration

The choice of the appropriate deployment model—cloud-based or on-premise—depends on the company’s specific requirements regarding security, costs, and integration with existing systems. Regardless of the chosen strategy, AI agents can significantly enhance operational efficiency and allow employees to focus on tasks that require creativity and strategic thinking.

The development of AI technology is undoubtedly one of the strongest technological trends in recent years. Therefore, it is worth considering now how AI agents can support your company’s growth and become a key element of its digital transformation.

We will create an AI Agent for your company.

Artykuł What are AI Agents and how can they help your company pochodzi z serwisu Inero Software - Software Consulting.

Meet Your Personal AI Agent: A Case Study for a Freight Forwarding Company

Marta Kuprasz — Fri, 21 Feb 2025 11:27:19 +0000

AI-driven tools are becoming increasingly prevalent across various industries, streamlining processes from simple graphic design and translations to advanced document, email, and database analysis. In this article, we will present a practical business application of an AI assistant in action.

AI Agents have a wide range of applications, and their full potential is still being discovered. The main advantages of AI-powered assistants include:

1. Automating Routine Processes

AI agents can handle repetitive tasks such as customer inquiries, document analysis, and data management. By automating these processes, businesses can reduce operational costs and improve efficiency.

2. Personalized Customer Interactions

By analyzing data, AI agents can provide personalized recommendations and tailored offers, enhancing customer engagement and improving overall user experience.

3. Speed and Availability

AI operates 24/7, delivering instant responses and real-time support. This is particularly valuable in industries that require quick reaction times, such as e-commerce, finance, and logistics.

4. Advanced Data Analysis

AI-powered agents can process vast amounts of data in a short time, identifying patterns and correlations that support better business decision-making.

5. Optimizing Decision-Making Processes

With predictive modeling, AI assists in demand forecasting, risk management, and supply chain optimization, helping organizations make more informed decisions.

6. Seamless Integration with Existing Systems

Modern AI solutions can be easily integrated into existing ERP, CRM, and analytics platforms, enhancing their capabilities and improving overall system efficiency.

A Practical Example of AI Agent Use in the Transport Industry

AI agents can be applied across various industries, including banking, sales, and human resource management. In this text, we will focus on a freight forwarding company that handles anywhere from a few to dozens of shipments daily.

Freight forwarders deal with constant communication and the verification of numerous documents. Each of these tasks takes time—a resource that is often in short supply—making errors more likely when the workload is high.

How can time management be improved? By automating repetitive and predictable tasks. This is where an AI Agent comes in. Here’s an example of an AI assistant we developed, powered by Google’s Large Language Model, Gemini.

One possible application is the following scenario:

A freight forwarder receives an email that should include an insurance policy along with proof of payment. The AI Agent automatically, without needing to be prompted, checks whether the email contains the required attachments. If they are included, it proceeds to verify the following details:

In the Insurance Policy:

- - Policy number
  - Insurance period and whether it is currently valid
  - Insured party details, including tax identification number and address
  - Bank account number for premium payment

In the Payment Confirmation:

- - Payment reference
  - Amount
  - Bank account number
  - Payment date
  - Whether the transfer corresponds to the submitted policy (e.g., based on the reference, account number)

The AI Agent then transfers the extracted data into a designated Excel file, which is continuously updated. The data file can be formatted accordingly, for example, by highlighting entries in red where the insurance policy is invalid or the payment has not been verified.

In this simple way, instead of searching through their inbox for the right emails, the freight forwarder can check the Excel file to see if the documents have been received from a specific sender and whether they are correct. This saves a significant amount of time and ensures data accuracy.

There are many ways to further develop our AI Assistant. It can be integrated with other tools, such as Slack or other communication platforms, to send notifications about missing documents or generate automated email responses. An AI-powered agent can be tailored to the specific needs of a company, a department, or even an individual role.

Do you want to explore the possibilities of AI Agents?

Schedule a meeting. We’d be happy to discuss the possibilities.

SCHEDULE A MEETING

Artykuł Meet Your Personal AI Agent: A Case Study for a Freight Forwarding Company pochodzi z serwisu Inero Software - Software Consulting.

Assessing Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) with DeepEval for Complex Tabular Data

Martyna Mul — Tue, 04 Feb 2025 10:33:15 +0000

Retrieval-Augmented Generation (RAG) models are transforming the capabilities of intelligent assistants, enabling more accurate and context-aware responses to user queries. Unlike traditional large language models (LLMs), RAG-based systems integrate two essential components: a retrieval mechanism that fetches relevant documents and a generative model that synthesizes responses based on real-time data. This post explores how DeepEval helps systematically assess the effectiveness of both retrieval and generation components, ensuring more reliable machine-generated insights.

While RAG-enhanced virtual assistants significantly improve answer relevance, evaluating their performance remains a challenge. Since these models rely on both retrieval and text generation, a weak document-fetching step can lead to misleading or incorrect responses, even if the underlying LLM is highly advanced.

We’ll demonstrate this process using our custom AI-driven assistant, designed to answer complex queries about maritime economy statistics, showcasing how LLM-powered knowledge retrieval enhances data-driven decision-making.

SeaStat - Our AI Assistant

A great example that we can use to discuss this topic is the SeaStat AI Assistant developed by us as part of the Incone60 Green Project (https://www.incone60.eu/). The goal of the project is to improve the competitiveness and sustainable development of small seaports in the South Baltic region.

During Incone60 Gren Project we have developed an AI assistant that answers questions about maritime economy data, providing instant access to structured maritime economic insights. This assistant leverages a Retrieval-Augmented Generation (RAG) approach, ensuring that responses are grounded in a structured database covering key aspects such as seaports, maritime transport, shipbuilding, passenger traffic, trade, and the fishing industry.

Our AI assistant operates within a RAG pipeline that integrates:

A structured maritime economy database, which includes global and Polish maritime statistics from 2017 to 2020. The data is sourced from publications by Gdynia Maritime University, which aggregate statistics from various government institutes, universities, and port enterprises. The database consists of 50 tables, covering key aspects of maritime transport and is planned to be further extended with additional years.

Dynamic SQL generation to extract relevant information from the database.

A generative LLM that formulates answers based on the retrieved data.

Building such an assistant requires several key decisions and parameter optimizations, including:

Selecting the most suitable LLM model and tuning parameters (e.g., temperature).

Designing an effective prompt structure.

Ensuring the assistant consistently selects the most relevant tables from the dataset.

This is where automatic testing becomes crucial. It helps assess system performance, identify weaknesses, and ensure continuous improvement.

LLM-as-a-Judge: Automating RAG Model Evaluation

Evaluating systems that generate non-deterministic, open-ended text outputs can be challenging because there is often no single “correct” answer. While human evaluation is accurate, it can be costly and time-consuming.

LLM-as-a-Judge is a method that approximates human evaluation by rating the system’s output based on custom criteria tailored to your specific application. One such testing framework is DeepEval, which provides a set of metrics designed for both retrieval and generation tasks and allows you to create your own rating criteria.

Key evaluation metrics are:

G-Eval: A versatile metric that evaluates LLM output based on custom-defined criteria.

Answer Relevancy: Measures how well the model’s response addresses the user query.

Faithfulness: Assesses how accurately the response aligns with the provided context, helping to limit hallucination in RAG systems.

ContextualRecallMetric, ContextualPrecisionMetric, ContextualRelevancyMetric: These metrics are particularly useful for RAG systems, evaluating whether retrieval components return all relevant context while avoiding irrelevant information.

Step-by-Step RAG Model Testing with DeepEval

To ensure the reliability and accuracy of our Retrieval-Augmented Generation (RAG) model, we follow a structured evaluation approach. This process involves dataset creation, response generation, and model evaluation using DeepEval, allowing us to systematically assess the effectiveness of both retrieval and generation components. Let’s break down each step in detail.

1. Dataset Creation

To evaluate performance, we create a test set consisting of:

– Realistic questions that users might ask. These can range from simple fact-based queries to more complex, multi-step inquiries that require detailed answers drawn from multiple tables.
– Expected ground truth responses derived directly from the database.

2. Generating Model Responses

For each test query, the assistant generates an answer based on the relevant data retrieved from the database.

3. Evaluation using DeepEval

We are particularly focused on factual correctness for our assistant, so we use the G-Eval metric to evaluate this aspect.

We need to define G-Eval by describing testing criteria, e.g.:

correctness_metric = GEval(     
    name="Correctness",      
    evaluation_steps=[   
        "Assess whether the actual output is accurate in terms of facts compared to the expected output.",       
        "Penalize missing information."   
    ],       
    evaluation_params=[   
       LLMTestCaseParams.INPUT,    
       LLMTestCaseParams.ACTUAL_OUTPUT,    
       LLMTestCaseParams.EXPECTED_OUTPUT   
    ],     
)

Additionally, we use several built-in metrics:

contextual_precision = ContextualPrecisionMetric() 
contextual_recall = ContextualRecallMetric() 
contextual_relevancy = ContextualRelevancyMetric() 
answer_relevancy = AnswerRelevancyMetric() 
faithfulness = FaithfulnessMetric()

We then define test cases:

test_case = LLMTestCase(   
    input=#user prompt,   
    actual_output=#model output here,   
    expected_output=#the ground truth response  

    retrieval_context=#data extracted by retriever, in our case it is data extracted from the database 
)

Here is one of test cases we used to evaluate our SeaStat Assitant:

test_case = LLMTestCase(   
    input='Compare cargo traffic in Suez Canal and Panama Canal in 2019',   
    actual_output= 'In 2019, the cargo traffic data for the Suez Canal and Panama Canal was as follows: Suez Canal - 1031 million tons; Panama Canal - 243059 thousand tons. The Suez Canal had significantly higher cargo traffic compared to the Panama Canal in 2019.'  
    expected_output=' In 2019, the Suez Canal handled 1,031 million tons of cargo, whereas the Panama Canal transported only 243 million tons. This indicates that the Suez Canal carried a substantially higher volume of cargo than the Panama Canal that year.'  

    retrieval_context=[ 

{'table': 'Suez_Canal_Cargo_Traffic', 'year': 2019, 'cargo_volume_million_tons': 1031}, 

{'table': 'Panama_Canal_Cargo_Traffic', 'year': 2019, 'direction': 'Atlantic – Pacific', 'cargo_volume_thousand_tons': 156899}, {'table': 'Panama_Canal_Cargo_Traffic', 'year': 2019, 'direction': 'Pacific – Atlantic', 'cargo_volume_thousand_tons': 86160} 

] 
)

And run evaluation:

assert_test(test_case, [correctness_metric, answer_relevancy, contextual_precision, contextual_recall, contextual_relevancy, faithfulness])

4. Testing results

DeepEval assigns each metric a score between 0 and 1, accompanied by a descriptive explanation of the rating. Below are the results from a test case evaluating SeaStat’s response to the prompt:

“Compare cargo traffic in the Suez Canal and Panama Canal in 2019.”

Metric interpretations:

Contextual Recall (1.0) – The retriever effectively retrieved the necessary information, meaning that almost all essential details from the expected output were present in the retrieval context.

Contextual Relevancy (0.95) and Contextual Precision (1.0) – The retrieved context was highly relevant to the query, showing that the retriever pulled information accurately related to the input.

Faithfulness (1.0) – The model’s response remained perfectly factual, strictly adhering to the retrieved information without introducing any hallucinations.

Answer Relevancy (1.0) – The model’s response fully addressed the user query, ensuring that the answer was on point.

Correctness, (0.78) – the correctness score was slightly lower due to numerical discrepancies caused by rounding.

By systematically analyzing test cases with DeepEval, we gain valuable insights into where our RAG model excels and where improvements are needed. Future optimizations could include refining retrieval strategies, adjusting prompt engineering, or fine-tuning LLM parameters for better factual accuracy.

Test case	Metric	Score	Status	Overall Success Rate
test_case_0	Correctness (GEval)	0.78 (threshold=0.5, evaluation model=gpt-4o, reason=The actual output closely matches the expected output in terms of cargo volumes and comparative conclusion, but the numbers are expressed in different units (thousand tons vs million tons) and slightly differ, which may indicate rounding or conversion discrepancies., error=None)	PASSED	100%
	Answer Relevancy	1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because the response thoroughly addressed the comparison of cargo traffic in the Suez Canal and the Panama Canal in 2019 with no irrelevant details included. It’s precise and to the point, showcasing a deep understanding of the topic., error=None)	PASSED
	Contextual Precision	1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because the relevant nodes, offering essential data for comparing cargo traffic in the Suez and Panama Canals in 2019, are perfectly ranked at the top. These nodes effectively deliver a comprehensive breakdown of cargo volumes through both canals during that year, ensuring accurate comparisons can be made efficiently., error=None)	PASSED
	Contextual Recall	1.0 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 1.00 because every sentence in the expected output aligns perfectly with the data from the nodes in the retrieval context, effectively illustrating the significant difference in cargo volumes handled by both canals. Well done on maintaining precise and accurate attention to detail!, error=None)	PASSED
	Contextual Relevancy	0.95 (threshold=0.5, evaluation model=gpt-4o, reason=The score is 0.95 because although the context is rich with detailed data on Suez Canal cargo traffic, it lacks specific information on the Panama Canal’s cargo traffic, necessitating additional data for a complete comparison., error=None)	PASSED
	Faithfulness	1.0 (threshold=0.5, evaluation model=gpt-4o, reason=Awesome job! The score is 1.00 because there are no contradictions present, showcasing perfect alignment and faithfulness of the actual output to the retrieval context. Keep up the excellent work!, error=None)	PASSED

Evaluating Retrieval-Augmented Generation (RAG) models requires a structured approach to ensure both retrieval accuracy and response reliability. LLM-as-a-Judge provides an efficient alternative to human evaluation by systematically assessing outputs based on predefined criteria, enabling scalable and cost-effective validation.

Using DeepEval, we tested our AI-driven SeaStat Assistant against key evaluation metrics, including Correctness (G-Eval), Answer Relevancy, Contextual Precision, Contextual Recall, Contextual Relevancy, and Faithfulness. The results highlighted minor discrepancies in numerical representation, missing contextual details, and retrieval precision—insights crucial for refining model performance.

These findings emphasize that even high-performing RAG models require rigorous evaluation to ensure factual accuracy and prevent misleading outputs. By automating this process, we enable continuous model improvement, ensuring AI-driven assistants deliver reliable, context-aware insights at scale.

AI-powered assistants are undoubtedly a technology that will become an indispensable tool for employees at all levels—from executives and directors to specialists. Their dynamic development allows them to instantly adapt to business needs and evolving expectations.

We create reliable AI assistants

If you're looking for a company to help you implement an AI-based solution, reach out to us. We’d be happy to discuss your idea.

Artykuł Assessing Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) with DeepEval for Complex Tabular Data pochodzi z serwisu Inero Software - Software Consulting.

A year under the sign of artificial intelligence development

Marta Kuprasz — Mon, 18 Dec 2023 10:32:30 +0000

The end of the year is a time for summaries. In the world of IT, many interesting things have happened, so in this article, we decided to focus on AI. The development of artificial intelligence and its media presence accelerated to an unprecedented scale. Tools based on Large Language Models (LLMs) have been popularized and made widely available to users from various industries, not just technological ones. We decided to summarize the year with Andrzej Chybicki, the CEO of Inero Software. Here is the list he identified as the key 5 events of the past year.

Fact 1: OpenAI – artificial intelligence becomes widely accessible

OpenAI played a tremendous role in popularizing the field of artificial intelligence in the context of human language understanding. In 2022, they released ChatGPT, and in the following months, they presented new, improved models. These advancements not only improved the performance of existing applications but also opened new avenues for AI in healthcare, environmental science, administration, marketing, and more.

In 2023, ChatGPT saw remarkable advancements, featuring enhanced learning algorithms for improved accuracy and nuanced conversations, personalized user interactions, expanded language support for global accessibility, and broader application integration. OpenAI emphasized ethical considerations and bias reduction, incorporated real-time learning for up-to-date content, improved multimedia interaction capabilities, and boosted the tool’s robustness and reliability. Additionally, ChatGPT was tailored for specific industries, providing specialized functionalities and knowledge, marking a significant leap in AI technology and user-centric applications.

Expert Insight

OpenAI was the first widely recognized large language model. In the coming years, we are likely to see various versions of LLMs designed for specific applications – in fact, this has been happening for a few months now. OpenAI, despite being a pioneer, at least in terms of recognizability, is not always considered the best model for everything. The direction of development is certainly popularization in a similar way as it was with computers (i.e., LLMs like PCs) and specialization, meaning specialized language models designed for specific applications or even entities or people.

Fact 2: GitHub Copilot – a leader in AI/LLM implementation

One of the key roles in the development of artificial intelligence is played by Microsoft, which collaborates with OpenAI. Over the past year, Microsoft has continued to refine its vision of Microsoft Copilot. Let’s focus on the solution for developers: GitHub Copilot. In 2023 it underwent significant changes and enhancements. Here are the key updates:

In 2023, GitHub Copilot introduced several significant enhancements to bolster its role in AI-driven software development. The GitHub Copilot Chat, now generally available and powered by OpenAI’s GPT-4, provides more accurate code suggestions and explanations, using natural language to aid developers in various languages. This feature is integrated with both the GitHub platform and its mobile app, supporting coding, pull requests, and documentation. Additionally, GitHub Copilot Enterprise was introduced to tailor the tool to specific organizational needs, helping developers quickly adapt to their organization’s codebase and streamline tasks like documentation and pull request reviews, aimed at boosting enterprise-level productivity and security. The GitHub Copilot Partner Program was launched, integrating Copilot with various third-party developer tools and services, thereby creating a broad ecosystem that enhances the capabilities of developers using AI. Finally, GitHub unveiled new AI-powered security features in its Advanced Security suite, including a real-time vulnerability prevention system and application security testing features to detect and remediate code vulnerabilities and secrets, further securing the software development process.

Expert Insight

Thanks to its collaboration with OpenAI, Microsoft became a leader in AI/LLM implementation worldwide in 2023. Microsoft’s strategy in this area is based on using the LLM model to support (but not replace) as many activities and processes using Microsoft products as possible. Particularly important was ensuring an appropriate level of SLA (aligned with other Azure services) and data security. Among the most significant changes, apart from the mentioned GitHub Copilot (which aims to support developers in coding), are Copilot plugins available in practically all of this company’s flagship products (Word, Excel, PowerPoint, Outlook).

In December 2023, Microsoft also presented the CoPilot Studio solution, which enables the creation of low-code/no-code IT systems with significant support from the OpenAI model. This effectively allows for the easy expansion of existing Azure low-code solutions such as Azure Agents with conversational bots or AI-supported database adapters. Although CoPilot Studio is not yet available in its final form, Microsoft clearly communicates development directions and the advantages that developers, engineers, and users can experience from its use. From the presentations of Microsoft representatives, it can be inferred that Microsoft’s goal is to lower the entry threshold for creating and implementing new advanced AI solutions, as using low-code platforms does not require as deep technical knowledge as traditional coding. We can expect widespread interest in these solutions not only from the largest companies using MS Azure in the coming years. Currently, among experts, the question is not “whether to use AI” but how to implement it to not fall behind the competition. Those entities that create a coherent strategy for incorporating AI-based products into their processes in the coming years will be able to significantly benefit from the revolution that is already taking place.

Fact 3: The European AI Act: A Regulatory Milestone

On 14 June 2023, the European Parliament adopted its negotiating position on the AI Act. Parliament’s priority is to make sure that AI systems used in the EU are safe, transparent, traceable, non-discriminatory and environmentally friendly. Parliament also wants to establish a technology-neutral, uniform definition for AI that could be applied to future AI systems. The AI Act sets different rules for different AI risk levels.

The new rules establish obligations for providers and users depending on the level of risk from artificial intelligence. While many AI systems pose minimal risk, they need to be assessed.

Unacceptable risk

Unacceptable risk AI systems are systems considered a threat to people and will be banned. They include:

Cognitive behavioral manipulation of people or specific vulnerable groups: for example voice-activated toys that encourage dangerous behavior in children
Social scoring: classifying people based on behavior, socioeconomic status or personal characteristics
Real-time and remote biometric identification systems, such as facial recognition

Some exceptions may be allowed: For instance, “post” remote biometric identification systems where identification occurs after a significant delay will be allowed to prosecute serious crimes but only after court approval.

High risk

AI systems that negatively affect safety or fundamental rights will be considered high-risk and will be divided into two categories:

1) AI systems that are used in products falling under the EU’s product safety legislation. This includes toys, aviation, cars, medical devices and lifts.

2) AI systems falling into eight specific areas that will have to be registered in an EU database:

Biometric identification and categorisation of natural persons
Management and operation of critical infrastructure
Education and vocational training
Employment, worker management and access to self-employment
Access to and enjoyment of essential private services and public services and benefits
Law enforcement
Migration, asylum and border control management
Assistance in legal interpretation and application of the law.

All high-risk AI systems will be assessed before being put on the market and also throughout their lifecycle. For more information, visit the European Parliament website.

*source: https://www.europarl.europa.eu

Expert Insight

Ensuring security and confidentiality of data is certainly one of the most important issues concerning the implementation of AI solutions. Many experts indicate that despite the good intentions of the European Commission, the proposed solutions may contribute to reducing the competitiveness of the domestic AI market, which in effect will increase the distance between Europe and leaders in this field (i.e., the USA and China). I personally share these concerns. Here, a good example might be the similar situation that occurred about 15 years ago when cloud computing was being implemented. At that time, the EU also created a regulation governing the rules of access and data confidentiality (GDPR), which to this day is the regulatory basis in this area. At the same time, the largest solutions that most in the EU use are those developed in the USA, where the priority was the free development of technology, and only secondarily the legal framework. Unfortunately, many indications suggest that a similar situation might occur with AI.

Fact 4: Gemini: new model from Google

Without a doubt, the launch of Gemini was the most prominent premiere in the latter part of 2023, generating significant buzz. It is a result of large-scale collaborative efforts by teams across Google. It was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across, and combine different types of information including text, code, audio, image, and video.

Gemini 1.0 was trained to recognize and understand text, images, audio, and more at the same time, so it better understands nuanced information and can answer questions relating to complicated topics. This makes it especially good at explaining reasoning in complex subjects like math and physics.

During the presentation on the release of the Gemini API for developers, a lot of time was dedicated to AI Studio, a browser-based, free tool for code creation. The second focus was on Vertex AI, a more advanced program that allows for “both training and deploying ML (machine learning) models and AI applications.” Google offers the option to transfer a preliminary project developed in AI Studio to Vertex AI, to add additional features available within the larger platform of Google Cloud.

Expert Insight

Google has officially joined the large language model (LLM) race. The most intriguing aspect of what they propose is that their model will operate in three versions: Ultra (the most feature-rich), Pro, and Nano, with the latter being designed for mobile phones. It’s still unclear whether Nano will run entirely on client devices (smartphones) or if it will simply be a thin client and a kind of extension of Google Assistant. It’s also worth emphasizing that Google, like Microsoft, will offer Gemini services as elements of its flagship products, such as Google Sheets, Google Docs, and others.

Fact 5: Advancements in Natural Language Processing (NLP)

2023 witnessed remarkable progress in the field of Natural Language Processing. Researchers and companies globally made significant strides in improving the accuracy and versatility of NLP models. These advancements have led to more sophisticated understanding and the generation of human language by machines, paving the way for more intuitive and natural human-computer interactions. This year saw the deployment of advanced NLP in various applications, from customer service chatbots to complex data analysis tools, revolutionizing how we interact with technology daily. This progress in NLP technology not only enhanced existing applications but also opened new possibilities for AI in fields such as education, content creation, and multilingual communication.

Expert Insight

AI technologies are increasingly breaking the barrier of understanding natural language, gradually blurring the line between structured data previously used in IT systems and human knowledge. It seems that the creation of AGI (Artificial General Intelligence), a machine matching or even surpassing the average human in many aspects, is now just a matter of time. The challenge for the world of science, business, and politics will now be to direct the development of AI in a way that serves the broadly understood humanity and does not cause threats that many (probably rightly) fear.

The last 12 months have been rich in interesting AI releases. The presentation of new large language models has opened up a range of possibilities for their implementation in everyday tasks, both in programming work and creative teams. European authorities are trying to keep up with these changes and adapt legal regulations to be in line with the current technological situation. In the coming months, we will certainly see more premieres, as leading players like Google and Microsoft compete to create solutions that utilize artificial intelligence.

Artykuł A year under the sign of artificial intelligence development pochodzi z serwisu Inero Software - Software Consulting.