ACCURACY OVERVIEW

Accuracy assesses whether Generative AI is correctly answering the questions from consumers. For each product or service, the LLMs are interrogated as to their knowledge of particular product/service features from the standpoint of the consumer.

In this framework, LLM Accuracy for a travel brand transcends basic queries. It focuses on the model's ability to understand and consistently report on highly specific details that potential customers are asking. The methodology is designed to test "long-tail type questions" that go beyond general information.

Accuracy for DMO

Accuracy for Hotels

Accuracy Module also is where you can find the hundreds of prompts that Bonafide uses to determine your accuracy score.

Examples of Specific Queries:

For a hotel: Instead of asking if there is a pool, a more specific prompt would be, "Does your pool have a lifeguard specifically and what time does a pool open?"
For a destination: Rather than asking if a festival exists in a certain month, a detailed query would be, "Do they have this festival in February and is it accessible to wheelchairs?"

The Scoring Methodology

Accuracy Score and What it Means

Interpretation: The accuracy that Generative AI applications have about products and services is an important metric for brands to track. Marketers, sales persons, and support persons should be aware of information Generative AI is providing consumers. Consensus holds that 95%+ of prompts must be correctly answered in order for applications to be considered ready for direct purchases in LLM Chat interfaces. For perspective, this affords only one in 20 questions that would be wrong.

Commerce Level	Feature Accuracy(%)	Description
Level 0: Not Ready	< 50% (poor)	LLM lacks sufficient knowledge.
Level 1: Simple Discovery	> 50% (fair)	Capable of simple discovery interactions.
Level 2: Travel Planning	> 70% (good)	Able to provide recommendations and shallow linking.
Level 3: Complex Travel Planning	> 85% (very good)	Supports complex travel planning with recommendations and deep linking.
Level 4: Direct Purchase Capable	> 95% (excellent)	Ready for in-application purchases directly from the chatbot.
Level 5: Agentic	97.5%+ (superior)	Capable of AI agent-to-agent transactions.

The accuracy score is determined through a systematic, multi-step process designed to compare LLM-generated answers against a brand's verified information.

Establishing the Source of Truth: The Bonafide Agent Crawler, which simulates an LLM crawler, is used to crawl and ingest all publicly available content from a brand's website and other official sources. This compiled data becomes the definitive System of Record or "source of truth." [More about System of Recored the Curator module]
Posing Targeted Questions: Bonafide queried from the LLMs to top 500 or so quest questions from LLMs specific, detailed questions about the travel brand are posed to a panel of LLMs (typically five or more).
Comparing Answers to the System of Record: The answers provided by each LLM are directly compared against the information contained within the established system of record.
Calculating the Accuracy Score: The score is the percentage of LLMs that successfully matched their answer to the System of Recored (aka source of truth).

◦ If 2 out of 5 LLMs match the source of truth, the score is 40%.

◦ If 4 out of 5 LLMs match the source of truth, the score is 80%.

The Role of LLM Consensus

The primary mechanism for scoring accuracy is the direct comparison of LLM answers to the system of record, not a consensus among the LLMs themselves. However, LLM consensus remains an important secondary metric, specifically to evaluate whether the models are aligned with each other when they are matching the System of Record (aka the source of truth).

Interpreting Blank Scores and Identifying Risks

The Anatomy of a Blank Score

A blank score, represented by two dashes (--), appears when two specific conditions are met simultaneously:

Absence in the System of Record: The crawler could not find an answer to the specific question anywhere on the brand's website or in its official content.
Lack of LLM Consensus: When the LLMs searched for an answer outside the system of record (on the broader web), there was no majority agreement among them on a single answer.

The 'Red Flag': Implications for Brands

A blank score is considered a "red flag" and a "threat" for a brand because it exposes a critical information gap that is being filled by unreliable sources.

Unauthoritative Information: Real people are asking these questions, and since the brand does not provide the answer, LLMs pull information from external sources like Reddit or Trip Advisor.
Conflicting Answers: Different LLMs provide varying answers to the same user query, leading to customer confusion and a lack of trust. This is described as a "complete misalignment" and can result in "hallucination" where the information provided is entirely incorrect.
Brand Damage: This scenario represents a significant problem where user inquiries are being met with variable and unverified answers from sources outside the brand's control.

How to Navigate to View Details of the Prompts and Score:

Navigate: Accuracy → Property → Click on the Score

Examples in Accuracy Scoring

The following examples illustrate how the accuracy scoring is applied to specific prompts for both hotel and destination brands.

Example 1: High Accuracy (Hotel Accessibility)

Category	Detail
Brand/Location	Four Seasons Hotel New York, New York
Feature Type	Accessibility and Guest Room Security
Specific Feature	Roll-in shower
Exact Prompt	"Does the Four Seasons Hotel New York... have a roll-in shower available?"
System of Record	Yes (found on the Four Seasons website)
Accuracy Score	80%

LLM Performance Breakdown (5 Models Tested):

Large Language Model	Answer Provided	Match?
Metal Llama	Yes	Yes
Anthropic Claude	Yes	Yes
Perplexity	Yes	Yes
OpenAI	Don't know	No
Google Gemini	Yes	Yes

Conclusion: Four out of the five models tested successfully matched the verified answer from the system of record, resulting in a high accuracy score.

Example 2: Low Accuracy (Destination: Local Markets)

Category	Detail
Brand/Location	Rochester, Minnesota
Feature Type	Food, Dining, and Cuisine
Specific Feature	Local markets
Exact Prompt	"What local markets are available in Rochester, Minnesota?"
System of Record	A specific list of four local markets (from Rochester's website)
Accuracy Score	16.7%

LLM Performance Breakdown (6 Models Tested):

Large Language Model	Answer Provided	Match?
Google Gemini	Provided the correct, matching list	Yes
Perplexity	Did not find all the markets	No
Metal Llama	Did not have the same list	No
Anthropic	Did not have the list at all	No
OpenAI	Incorrect	No
OpenAI 4.0	Incorrect	No

Conclusion: Only one of the six models tested was able to replicate the correct list of markets found in the system of record, leading to a very low accuracy score.