Blog

Prompt security Tokenizer security Prompt engineering prompt injection

LLMs and GenAI application pipelines evaluations, metrics and risks

Alert AI  Alerts in LLM Metric Evaluation risks

 

Introduction

 

LLMs encounter many issues when running but is it easy to detect these issues? To solve this issue, Alert AI uses Detections. An LLM Alert is a detailed alert that describes errors and provides a recommendation to users and developers. When alerts aren’t used it is more difficult to detect errors and vulnerabilities in a model. Using Alerts makes it easier to detect issues in an LLM.

 

Alert AI Generative AI security platform – Identifies risks, detects threats, generates Alerts in Generative AI applications, services, environments, deployments.  Alert AI security analytics pipelines extracts features, sessionizes, aggregates, classifies and continuously generates Alerts, Recommendations, AI Detection & Response, AI Forensics, Compliance & Governance, and Feedback loop to training, tuning, evaluation, inference pipelines.

 

There are many different kinds of LLM Alerts that exist to detect issues within an LLM. Each of these alerts identifies an issue with the LLM along with additional details of the alert including the cause of the alert and the recommendation for the alert. These alerts are shown when the value of the model does not meet a given criteria or when a security risk occurs. Below are the list of alerts.

 

Truthfulness Alert

  • Name: Truthfulness percentage of model is too low
    • Description: Model’s output contains too much false information
    • Metric/Metrics: Truthfulness
    • Range, Type: Percentage, greater than 85%
    • Recommendation: Provide dataset with more truthful information
    • Severity: Major
    • Class: TruthfulQA
    • Category: Data and Information
    • Cause: Model trained with dataset containing incorrect information
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low truthfulness metric in models. This alert is used for models that specialize in providing factual information.

 

Informative Accuracy Alert

  • Name: Informative Accuracy percentage of model is too low
    • Description: Model’s output has little to no informative accuracy
    • Metric/Metrics: Informative Accuracy
    • Range, Type: Percentage, greater than 80%
    • Recommendation: Tweak model to gather more informative data
    • Severity: Major
    • Class: TruthfulQA
    • Category: Data and Information
    • Cause: Model did not retain enough information from dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low informative accuracy metric in models. This alert is used for models that specialize in providing accurate information.

 

Correctness Alert

  • Name: Correctness percentage of model is too low
    • Description: Model’s output has too much incorrect information
    • Metric/Metrics: Correctness
    • Range, Type: Percentage, greater than 85%
    • Recommendation: Provide dataset with more correct information
    • Severity: Major
    • Class: TruthfulQA, MBPP, HumanEval
    • Category: Data and Information
    • Cause: Model gathered incorrect information from dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low correctness metric in models. This alert is used for models that specialize in providing factual information.

 

Consistency Alert

  • Name: Consistency percentage of model is too low
    • Description: Model’s output has little to no consistency in regards to question
    • Metric/Metrics: Consistency
    • Range, Type: Percentage, greater than 90%
    • Recommendation: Tweak model to be more consistent to prompt
    • Severity: Major
    • Class: TruthfulQA, BIG
    • Category: Question and answer
    • Cause: Model does not understand question given
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low consistency metric in models. This alert is used for models that specialize in answering questions along with conversations between the model and user.

 

Coverage Alert

  • Name: Coverage percentage of model is too low
    • Description: Model does not cover enough topics
    • Metric/Metrics: Coverage
    • Range, Type: Percentage, greater than 90%
    • Recommendation: Provide more training to model with various topics
    • Severity: Major
    • Class: TruthfulQA, BIG
    • Category: Coverage
    • Cause: Model did not train with enough topics
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low coverage metric in models. This alert is used for models that specialize in multiple fields of topics.

 

Calibration Alert

  • Name: Calibration value of model is too high
    • Description: Model keeps changing its output when being doubted by user
    • Metric/Metrics: Calibration
    • Range, Type: Decimal, less than 0.01
    • Recommendation: Tweak model to maintain same output
    • Severity: Major
    • Class: TruthfulQA
    • Category: Calibration
    • Cause: Model is has little to no confidence in responses, calibration value is too high compared to accuracy
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a high calibration metric in models. A low calibration result would result in the model changing its answer constantly. This alert is used for the majority of models since the models require confidence in their answers.

 

Accuracy Alert

  • Name: Accuracy of model is too low
    • Description: Model is not providing accurate responses to given question
    • Metric/Metrics: Accuracy
    • Range, Type: Percentage, greater than 85%
    • Recommendation: Train model with more accurate data
    • Severity:  Major
    • Class: HellaSwag, BIG, MMLU, MLFlow LLM Evaluate, RAGAs, Arize Phoenix, DeepEval
    • Category: Model accuracy
    • Cause: Model did not learn well from training dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low accuracy metric in models. This alert is used for models that specialize in factual information and answering questions.

 

Perplexity Alert

  • Name: Perplexity of model is too high
    • Description: Model cannot determine closest response that best matches question
    • Metric/Metrics: Perplexity
    • Range, Type: Integer, less than 10
    • Recommendation: Train model to make connections
    • Severity: Major
    • Class: HellaSwag, DeepEval
    • Category: Probability and Logistics
    • Cause: Model did not train well with understanding connections
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low perplexity metric in models. This alert is used for models that specialize in factual information and answering questions.

 

Log-Likelihood Alert

  • Name: Log-Likelihood of model is too low
    • Description: Model cannot choose correct response to question
    • Metric/Metrics: Log-Likelihood
    • Range, Type: Decimal, Higher log-likelihood
    • Recommendation: Train model to generate increased probability for dataset distribution
    • Severity: Major
    • Class: HellaSwag
    • Category: Probability and Logistics
    • Cause: Model cannot predict trends in dataset distribution well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low log-Likelihood metric in models. This alert is used for models that specialize in dataset trends and predictions.

 

Log-Probability Alert

  • Name: Log-Probability of model is too low
    • Description: Model utilizing probability of responses that are too low
    • Metric/Metrics: Log-Probability
    • Range, Type: Decimal, Higher log-likelihood
    • Recommendation: Train model to generate increased probability to detect better responses
    • Severity: Major
    • Class: HellaSwag
    • Category: Probability and Logistics
    • Cause: Model cannot predict trends in dataset distribution well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low log-Probability metric in models. This alert is used for models that specialize in dataset trends and predictions.

 

F1 Score Alert

  • Name: F1 Score of model is too low
    • Description: Model is detecting too many false positive and false negatives
    • Metric/Metrics: F1 Score
    • Range, Type: Percentage, greater than 85%
    • Recommendation: Train model to detection false positive and false negatives
    • Severity: Major
    • Class: HellaSwag, TriviaQA, MLFlow LLM Evaluate, RAGAs, Arize Phoenix, DeepEval
    • Category: False detections
    • Cause: Model cannot tell difference between true detections and false detections
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low F1 Score metric in models. This alert is used for models that check information or responses.

 

Information Gain Alert

  • Name: Information Gain value of model is too low
    • Description: Model is not gaining enough information from the dataset
    • Metric/Metrics: Information Gain
    • Range, Type: Percentage, greater than 80%
    • Recommendation: Train model to gather more information from dataset
    • Severity: Major
    • Class: BIG
    • Category: Data and Information
    • Cause: Model did not learn enough information from training dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low information gain metric in models. This alert is used for models that specialize in providing factual information.

 

Response Quality Alert

  • Name: Quality of model’s responses is too low
    • Description: Model is not generating good quality responses to questions
    • Metric/Metrics: Response Quality
    • Range, Type: Percentage, greater than 80%
    • Recommendation: Train Model to use correct terminology and concepts
    • Severity: Major
    • Class: BIG
    • Category: Questions and Answers
    • Cause: Model is not using right terminology or concepts to respond to question
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low response quality metric in models. This alert is used for models that specialize in providing factual information and answering questions.

 

Execution Accuracy Alert

  • Name: Model’s execution accuracy is too low
    • Description: Model’s code cannot be executed well
    • Metric/Metrics: Execution Accuracy
    • Range, Type: Percentage, greater than 70%
    • Recommendation: Train model to provide code responses that can run without errors
    • Severity: Major
    • Class: MBPP
    • Category: Model Accuracy
    • Cause: Model’s code samples contain too many coding errors
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low execution accuracy metric in models. This alert is used for models that specialize in providing code for given prompts.

 

Pass @k Alert

  • Name: Model’s k pass value is too low
    • Description: Not enough code responses from the model are passing the test cases
    • Metric/Metrics: Pass @k
    • Range, Type: Percentage based on k value, higher percentages for higher k value
    • Recommendation: Train model to provide code responses that can pass all test cases
    • Severity: Major
    • Class: MBPP
    • Category: Code Answers
    • Cause: Code samples provided to model do not pass enough test cases
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low pass @k metric in models. This alert is used for models that specialize in providing code for given prompts.

 

Code Quality Alert

  • Name: Model’s code quality value is too low
    • Description: Code quality of model’s response are not good
    • Metric/Metrics: Code Quality
    • Range, Type: Percentage, greater than 70%
    • Recommendation: Train model to provide better quality code responses
    • Severity: Major
    • Class: MBPP
    • Category: Code Answers
    • Cause: Code samples given to model do not have good quality
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low code quality metric in models. This alert is used for models that specialize in providing code responses for given prompts.

 

Sample Efficiency Alert

  • Name: Model’s sample efficiency is too low
    • Description: Model is not generating good code based on samples
    • Metric/Metrics: Sample Efficiency
    • Range, Type: Higher sample efficiency
    • Recommendation: Provide more sample for model to train with
    • Severity: Major
    • Class: MBPP
    • Category: Model Efficiency
    • Cause: Model does not have enough sample in dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low sample efficiency metric in models. This alert is used for models that specialize in providing efficient code responses for given prompts.

 

Weighted Accuracy Alert

  • Name: Model’s weighted accuracy is too low
    • Description: Number of correct answers chosen by model weighted is significantly less than total number of questions for task
    • Metric/Metrics: Weighted Accuracy
    • Range, Type: Percentage, greater than 70%
    • Recommendation: Train model to learn information from subjects better
    • Severity: Major
    • Class: MMLU
    • Category: Model Accuracy
    • Cause: Model did not learn for sample dataset well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low weighted accuracy metric in models. This alert is used for models that specialize in answering questions and providing factual information.

 

Subject-wise Accuracy Alert

  • Name: Model’s Subject-wise Accuracy is too low
    • Description: Model is not accurate when answering questions from one or more subjects
    • Metric/Metrics: Subject-wise Accuracy
    • Range, Type: Percentage, greater than 65%
    • Recommendation: Train model to learn information from specific subjects better
    • Severity: Major
    • Class: MMLU
    • Category: Model Accuracy
    • Cause: Model did not learn of specific subjects from sample dataset well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low subject-wise accuracy metric in models. This alert is used for models that specialize in answering questions, showing knowledge in multiple fields, and providing factual information.

 

Macro-average Accuracy Alert

  • Name: Model’s Macro-average Accuracy is too low
    • Description: Model is not accurate for each task disregarding number of questions
    • Metric/Metrics: Macro-average Accuracy
    • Range, Type: Percentage, greater than 70%
    • Recommendation: Train model to learn information from subjects better
    • Severity: Major
    • Class: MMLU
    • Category: Model Accuracy
    • Cause: Model did not learn for sample dataset well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low macro-average accuracy metric in models. This alert is used for models that specialize in answering questions, showing knowledge in multiple fields, and providing factual information.

 

Micro-average Accuracy Alert

  • Name: Model’s Micro-average Accuracy is too low
    • Description: Model is not accurate for each task disregarding content of tasks
    • Metric/Metrics: Micro-average Accuracy
    • Range, Type: Percentage, greater than 70%
    • Recommendation: Train model to learn information from subjects better
    • Severity: Major
    • Class: MMLU
    • Category: Model Accuracy
    • Cause: Model did not learn for sample dataset well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low micro-average accuracy metric in models. This alert is used for models that specialize in answering questions, showing knowledge in multiple fields, and providing factual information.

 

Exact Match Alert

  • Name: Not enough answers from model match correct answers for questions
    • Description: Model is not making enough correct responses
    • Metric/Metrics: Exact Match
    • Range, Type: Percentage, greater than 75%
    • Recommendation: Train model with better dataset
    • Severity: Major
    • Class: TriviaQA, RAGAs
    • Category: Response matching
    • Cause: Model did not learn for the training dataset well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low number of exact matches in a model. This alert is used for models that specialize in answering questions and providing correct information.

 

Precision Alert

  • Name: Model’s precision value is too low
    • Description: Model’s number of positive predictions out of predicted instances is too low
    • Metric/Metrics: Precision
    • Range, Type: Percentage, greater than 80%
    • Recommendation: Train model to better understand question and to better gather data from training dataset
    • Severity: Major
    • Class: TriviaQA, MLFlow LLM Evaluate, RAGAs, Arize Phoenix, DeepEval
    • Category: Model Accuracy
    • Cause: Model did not gather information or understand questions well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low precision metric in models. This alert is used for models that specialize in answering questions and providing predictions.

 

Recall Alert

  • Name: Model’s recall value is too low
    • Description: Model’s number of positive predictions out of actual instances is too low
    • Metric/Metrics: Recall
    • Range, Type: Percentage, greater than 80%
    • Recommendation: Train model to better understand question and to better gather data from training dataset
    • Severity: Major
    • Class: TriviaQA, MLFlow LLM Evaluate, RAGAs, Arize Phoenix, DeepEval
    • Category: Model Accuracy
    • Cause: Model did not gather information or understand questions well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low recall metric in models. This alert is used for models that specialize in answering questions and providing predictions.

 

Answer Length Alert

  • Name: Model’s answer length is too short
    • Description: Model’s responses to given question is too short
    • Metric/Metrics: Answer length
    • Range, Type: around average length of answers
    • Recommendation: Train model to provide longer answers for responses
    • Severity: Major
    • Class: TriviaQA
    • Category: Question and Answer
    • Cause: Sample dataset given to model provides too many short answers
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a short answer length metric in models. This alert is used for models that specialize in providing long answers.

 

BLEU Score Alert 

  • Name: Model’s BLEU score is too low
    • Description: Model’s response in comparison to reference texts is not good quality
    • Metric/Metrics: BLEU Score
    • Range, Type: Decimal, between 0 and 1, greater than 0.75
    • Recommendation: Train model to understand reference texts better
    • Severity: Major
    • Class: MLFlow LLM Evaluate, RAGAs, DeepEval
    • Category: Reference texts
    • Cause: Model did not learn from training reference texts well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low BLEU metric in models. This alert is used for models that specialize in referencing texts.

 

ROGUE Score Alert 

  • Name: Model’s ROGUE score is too low
    • Description: Model’s response in comparison to reference texts is not similar
    • Metric/Metrics: ROUGE Score
    • Range, Type: Decimal, between 0 and 1, greater than 0.75
    • Recommendation: Train model to understand reference texts better
    • Severity: Major
    • Class: MLFlow LLM Evaluate, RAGAs, DeepEval
    • Category: Reference texts
    • Cause: Model did not learn from training reference texts well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low ROUGE metric in models. This alert is used for models that specialize in referencing texts.

 

Mean Square Error Alert

  • Name: Model’s Mean Square Error is too high
    • Description: Model’s predicted value compared to actual value has a large difference
    • Metric/Metrics: Mean Square Error
    • Range, Type: Decimal, less than 0.05
    • Recommendation: Train model to understand reference texts better
    • Severity: Major
    • Class: MLFlow LLM Evaluate, Arize Phoenix, DeepEval
    • Category: Reference texts
    • Cause: Model did not learn from training reference texts well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a high Mean Square Error metric in models. This alert is used for models that specialize in referencing texts.

 

Retrieval Accuracy Alert

  • Name: Model’s retrieval accuracy is too low
    • Description: Model is not providing useful information based on retrieval texts
    • Metric/Metrics: Retrieval Accuracy
    • Range, Type: Percentage, greater than 90%
    • Recommendation: Train model to understand how to retrieve texts better
    • Severity: Major
    • Class: RAGAs
    • Category: Reference texts, Model accuracy
    • Cause: Model is not retrieving texts well
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low retrieval accuracy metric in models. This alert is used for models that specialize in referencing texts.

 

Area Under the Curve Alert

  • Name: Model’s Area Under the Curve is too low
    • Description: Model’s cannot distinguish classes well
    • Metric/Metrics: Area Under the Curve
    • Range, Type: Decimal, between 0 and 1, greater than 0.9
    • Recommendation: Train model to better detect classes
    • Severity: Major
    • Class: Arize Phoenix, DeepEval
    • Category: Model Evaluation
    • Cause: Model cannot detect unique features for each class
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low retrieval accuracy metric in models. This alert is used for models that specialize in referencing texts.

 

Latency Alert

  • Name: Model’s Latency value is too high
    • Description: Model is taking too long to make predictions
    • Metric/Metrics: Latency
    • Range, Type: Integer, less than 200 ms
    • Recommendation: Improve model’s algorithm to improve latency time
    • Severity: Major
    • Class: Arize Phoenix, DeepEval
    • Category: Model runtime
    • Cause: Model algorithm is too inefficient
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a high latency metric in models. This alert is used in the majority of models to check if the model is running quickly and for monitoring model performance.

 

Uptime Alert

  • Name: Model’s uptime value is too low
    • Description: Model is not running all the time
    • Metric/Metrics: Uptime
    • Range, Type: Percentage, greater than 99.9%
    • Recommendation: Fix failures in model to improve model’s performance
    • Severity: Major
    • Class: Arize Phoenix
    • Category: Model runtime
    • Cause: Model is not operating due to failures in model
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a low uptime metric in models. This alert is used in the majority of models to check if the model is running quickly and for monitoring model performance.

 

 Data Drift Alert

  • Name: Model’s data drift value too high
    • Description: Input Data changes too much overtime
    • Metric/Metrics: Data Drift
    • Range, Type: Minimal drift
    • Recommendation: Provide data to model that is consistent to previous data inputted
    • Severity: Major
    • Class: Arize Phoenix
    • Category: Data input
    • Cause: Data provided to model is too different from previous data
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a high data drift metric in models. This alert is used in models that specialize in data collection and for providing predictions.

 

Model Drift Alert 

  • Name: Model’s model drift value too high
    • Description: Model’s predictions are not consistent
    • Metric/Metrics: Model Drift
    • Range, Type: Minimal drift
    • Recommendation: Train model to better detect trends in dataset
    • Severity: Major
    • Class: Arize Phoenix
    • Category: Model predictions
    • Cause: Model is not able to understand trends in dataset
    • Provider{Library}: Metric Evaluation

This alert is used for detecting a high model drift metric in models. This alert is used in models that specialize in data collection and for providing predictions.

Evaluation Name Description General Insights Risk/Security/Vulnerability Metric Type/Boolean/Analogous Range, Recommended Value
TruthfulQA Benchmark to measure how truthful LLM is when generating answers to questions

38 categories with 817 questions

Questions are crafted in such a way that humans would answer incorrectly due to misconceptions or false beliefs
Determines how truthful a model is

Used in the medical, legal, and educational fields to check where factual correctness

Balancing informativeness and truthfulness major challenge since isolating incorrect information is tricky
Larger models are less truthful but more informative

Hard to determine correct answer due to questions being worded in a tricky way

Model can provide inappropriate or offensive answers

Model can make up facts based on misinterpretations
Truthfulness(Percentage)
- Number of accurate answers the model made out of the total accurate answers

Informative Accuracy(Percentage)
- Number of informative answers the model made out of the total accurate answers

Correctness(Percentage)
- Number of correct contextual answers the model made out of the total accurate answers

Consistency(Percentage)
- Number of consistent answers the model made in regards to given question out of the total accurate answers

Coverage(Percentage)
- Number of topics the model can provide good answers out of total topics

Calibration(Percentage)
- How confident a model is out of its actual accuracy
Truthfulness >= 85%

Informative Accuracy >= 80%

Correctness >= 85%

Consistency >= 90%

Coverage >= 90%

Calibration <= 0.05
HellaSwag Challenge dataset for evaluating commonsense

Used to test NLP models

Contains a list of questions and multiple choice answers
Determines how much commonsense a model has
Used to enhance model's ability to understand humans and logical thinking

Used to generate text that meet human expectations

Difficult for model to improve and understand context of a given scenario

Model may have difficulties predicting next steps accurately for given situation
Model can misinterpret the context of the questions

Model can provide inappropriate or offensive answers

Model may not be able to differentiate different types of commonsense
Accuracy(Percentage)
- Number of correct answers chosen by model out of total number of questions

Perplexity((Integer)
- Models' ability to predict probability distribution of data compared to data's actual distribution

Log-Likelihood(Decimal)
- Log-Probability of model choosing correct answer

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives
Accuracy >= 85%

Perplexity <= 10

Log-Likelihood: Higher log-likelihood

Log-Probability: higher log-probability

F1 Score >= 85%
BIG Intended to probe LLM and extend application of LLM future capabilities

Includes more than 200 tasks

Provides view containing model's performance across various tasks
Determines how much information is gained from model's responses
Used for programs and applications including ai assistants and educational tools that require ability to learn new information
Model may have difficulties giving responses that coherent and relevant while providing information and high quality responses
Excessive includes more than 200 tasks

Excessive variance of performance across various tasks

Insufficient probe of LLM future capabilities
Information Gain(Bits)
- the amount of new information model has learned based on model's response

Response Quality(Integer between 1 to 5)
- relevance, informativeness, and clarity of model's response

Accuracy(Percentage)
- Number of correct answers chosen by model out of total
- How factually correct the model's answers are

Coverage(Percentage)
- Number of topics the model can provide good answers out of total topics

Consistency(Percentage) - Number of consistent answers the model made in regards to given question out of the total accurate answers
Information Gain >= 80%

Response Quality >= 80%
Accuracy >= 80%

Coverage >= 80%

Consistency >= 85%
MBPP 1000 crowd-sourced Python programming problems

solvable by entry level programmers

problems consist of task description, code solution, and 3 automated test cases
Determines how well a model can generate correct Python Code

Used for coding assistants, educational tools, and automated code generation

Model may not generate code that is syntactically correct and logically sound that needs to meets specific requirements
Not enough test cases to assess whether model’s answer is correct or not

Model may generate malicious code

Model can generate code that is logically correct but semantically incorrect

The code the model generates can contain vulnerabilities
Correctness(Percentage)
- Number of correct functional Python code answers the model made out of the total number of tasks

Execution Accuracy(Percentage)
- Number of functional Python code answers the model made that run without errors out of the total number of code snippets the model made

Pass@k(Percentage)
- Probability of one out of the k Python code answers the model made that passes all the test cases for a task given

Code Quality(Integer between 1 to 5)
- score that assesses readability, maintainability, and quality of code model made
- higher score indicates the quality is better

Sample Efficiency(Percentage)
- number of samples the model needs to generate a correct solution for the given task
- lower value means the model is more efficient
Correctness >= 75%

Execution Accuracy >= 70%

Pass @k: Percentage increases when k value is larger

Code Quality >= 70%

Sample Efficiency: High sample efficiency
MMLU measure knowledge acquired during pretraining by evaluating models specifically during zero-shot and few-shot settings

more challenging but similar to how humans are evaluated

covers 57 STEM subjects

ranges in difficulty from elementary level to advanced professional level

granularity and breadth of subjects ideal for identifying blind spots for model
Determines the performance of a model by testing the model's knowledge across various subjects

Used to develop models that can handle questions and queries from various topics

Difficult for model to achieve good performance across all subjects
Model can make up facts based on misinterpretations

Model may leak private data if this data is not properly secured

Model may struggle to generalize across tasks

Model is vulnerable to adversarial attacks
Accuracy(Percentage)
- Number of correct answers chosen by model out of total number of questions

Weighted Accuracy(Percentage)
- Number of correct answers chosen by model weighted by the total number of questions for each task

Subject-wise Accuracy(Percentage)
- How accurate the model is when answering questions from a specific subject

Macro-average Accuracy(Percentage)
- How accurate a model is for each task disregarding the number of questions for each task
- Useful for understanding performance consistency for all tasks

Micro-average Accuracy(Percentage)
- How accurate a model is for each task disregarding the content of the question and the task the question belongs to
- Useful for measuring performance of the model for all tasks
Accuracy >= 70%

Weighted Accuracy >= 70%
Subject-wise Accuracy >= 65% per subject

Macro-average Accuracy >= 70%

Micro-average Accuracy >= 70%
TriviaQA reading comprehension dataset

contains over 650,000 question-answer-evidence triples

contains 95,000 question answer pairs

contains human-verified and machine-generated QA subsets
Determines how well a model can answer trivia questions

Used for create applications such as quiz games, trivia application, and ai assistants

Model may have difficulties providing precise and knowledgeable answer while aiming for a high accuracy
Model can provide inappropriate or offensive answers

Model can make up facts based on misinterpretations

Model may reveal private data if private data is included in training set
Exact Match(Percentage)
- Number of answers from the model that exactly matches the correct answers out of the total number of questions
- Helps the model measure how precise correct answers are

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives

Precision(Percentage)
- Number of correct positive predictions model makes out of total predicted positive instances

Recall(Percentage)
- Number of correct positive predictions model makes out of total actual positive instances

Answer Length(Number of Words or Characters)
- average length of a model's answer
- helps understand if model is telling concise or long answers
Exact Match >= 75%

F1 Score >= 80%

Precision >= 80%

Recall >= 80%

Answer length: around average length of answers
HumanEval used to measure functional correctness for synthesizing programs from docstrings

contains 164 problems

covers programming, language comprehension, algorithms, simple mathematics, software interview questions
Determines how well a model can generate correct and functional code for a programming question

Used for coding assistants, educational tools, and automated code generation

Model may have difficulties providing code that is a reliable and efficient for programming questions
Model may generate malicious code

Model can generate code that is logically correct but semantically incorrect

The code the model generates can contain vulnerabilities
Pass@k(Percentage)
- Probability of one out of the k code answers the model made that passes all the test cases for a task given

Correctness(Percentage)
- Number of correct functional code answers the model made out of the total number of tasks

Execution Accuracy(Percentage)
- Number of functional code answers the model made that run without errors out of the total number of code snippets the model made

Code Quality(Integer between 1 to 5)
- score that assesses readability, maintainability, and quality of code model made
- higher score indicates the quality is better

Sample Efficiency(Percentage)
- number of samples the model needs to generate a correct solution for the given task
- lower value means the model is more efficient
Pass @k: Percentage increases when k value is larger

Correctness >= 70%

Execution Accuracy >= 65%

Code Quality >= 70%

Sample Efficiency: High sample efficiency
MLFlow LLM Evaluate Open Source Library

MLFlow LLM Evaluate

Evaluation functionality comprised of 3 main components
- model to evaluate
- metrics
- evaluation data


mel to evaluate

metrics

evaluation data
Determines how well a model performs for various metrics

Useful for monitoring and improving a model's performance

Model may have difficulties balancing multiple metrics to achieve a good performance
Model may reveal private data if private data is included in training set

Models that are evaluated may be vulnerable to adversarial attacks

Models that are known to have vulnerabilities may be insecure
Accuracy(Percentage)
- Number of correct answers chosen by model out of total number of questions

Precision(Percentage)
- Number of correct positive predictions model makes out of total predicted positive instances

Recall(Percentage)
- Number of correct positive predictions model makes out of total actual positive instances

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives

Perplexity((Integer)
- Models' ability to predict probability distribution of data compared to data's actual distribution

BLEU Score(Decimal between 0 or 1)
- measure the quality of text the model generates by comparing the model's text with one or more reference texts

ROUGE Score(Decimal between 0 or 1)
- measures how similar the generated text and reference text are based on the model's ability to evaluate summarization and translation
- higher the scores mean the model is summarize and translating better

Mean Square Error(Decimal)
- average of square differences between a model's predicted and actual values
Evaluate Accuracy >= 85%

Evaluate Precision >= 85%

Evaluate Recall >= 85%

Evaluate F1 Score >= 85%

Evaluate BLEU Score >= 0.75

Evaluate ROUGE Score >= 0.75

Evaluate Mean Square Error <= 0.05
RAGAs Open Source Library

framework that helps evaluate pipelines involving RAG

class of LLM applications that use external data for augmenting LLM's context
Determines how well a model can retrieve documents and provide accurate answers

Used for applications involving customer service, question-answer retrieval, and information gathering

Model may have difficulties doing both information retrieval and coherent answer generation for responses that require accuracy
Model may misinterpret information lead to misleading output

If model has access to confidential data, model may leak that information

Model can be manipulated to provide harmful information
Accuracy(Percentage)
- Number of correct answers chosen by model out of total number of questions

Exact Match(Percentage)
- Number of answers from the model that exactly matches the correct answers out of the total number of questions
- Measures the correctness of the model's answers

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives

Precision(Percentage)
- Number of correct positive predictions model makes out of total predicted positive instances

Recall(Percentage)
- Number of correct positive predictions model makes out of total actual positive instances

BLEU Score(Decimal between 0 or 1)
- measure the quality of text the model generates by comparing the model's text with one or more reference texts

ROUGE Score(Decimal between 0 or 1)
- measures how similar the generated text and reference text are based on the model's ability to evaluate summarization and translation
- higher the scores mean the model is summarize and translating better

Retrieval Accuracy(Percentage)
- how effective a model is at providing useful information using documents the model retrieves out of the total documents
Accuracy >= 85%

Exact Match >= 80%

F1 Score >= 85%

Precision >= 90%

Recall >= 90%

BLEU Score >= 0.75

ROUGE Score >= 0.75

Retrieval Accuracy >= 90%
Arize Phoenix Open Source Library

Designed for experimentation, evaluation, and troubleshooting

allows users to visualize data, evaluate model performance, track issues, and export data for improvement easily
Determines the model's ability to monitor and evaluate models that are deployed by evaluating certain metrics including accuracy, precision, and data drift

Useful for monitoring and enhancing model's performance

Model may have difficulties detecting and minimizing certain metric issues including data drift and model drift
Model may be sensitive in detecting issues in a model leading to false alarms

Private data can be leaked from model if its not secured properly

Changes in input distribution can lead to model performing poorly
Accuracy(Percentage)
- Number of correct predictions chosen by model out of total number of predictions

Precision(Percentage)
- Number of correct positive predictions model makes out of total predicted positive instances

Recall(Percentage)
- Number of correct positive predictions model makes out of total actual positive instances

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives

Mean Square Error(Decimal)
- average of square differences between a model's predicted and actual values

Area Under the Curve(Decimal between 0 to 1)
- measures the model's ability to distinguish classes
- higher AUC means the model is performing better

Latency(Integer in seconds or milliseconds)
- How long a model takes to make a prediction
- Lower value indicates the model is performing faster

Uptime(Percentage)
- Proportion of time model is operating and open out of total time
- Higher value means model is more reliable and available

Data Drift(Integer)
- Measures how much the input data changes over time compared to the data model used for training
- Lower score means the input data is very similar to the training data

Model Drift(Integer)
- Measures how much the model's predictions changes over time
- Lower score means the model's predictions are consistent
Accuracy >= 85%

Precision >= 85%

Recall >= 85%

F1 Score >= 85%

Mean Square Error <= 0.05

Area Under the Curve >= 0.90

Latency <= 200 ms

Uptime >= 99.9%

Data Drift: Minimal drift

Model Drift: Minimal drift
DeepEval Open Source Library

Easy to build and iterate on LLM

Built under following principles
- unit tests LLM outputs
- plug and use 14 or more evaluation metrics for LLM
- dataset generation that is synthetic with evolution techniques
- simple customizable metrics and covers all use cases
- real-time evaluations in production
Evaluates the framework for language models by using multiple performance metrics

Useful for assessments for models that are comprehensive

Model may have difficulties balancing different metrics for creating a model that is reliable and robust
Private data can be leaked from model if its not secured properly

Data can be leaked if data in the test dataset is not isolated from the training dataset

Models that are evaluated may be vulnerable to adversarial attacks
Accuracy(Percentage)
- Number of correct predictions chosen by model out of total number of predictions

Precision(Percentage)
- Number of correct positive predictions model makes out of total predicted positive instances

Recall(Percentage)
- Number of correct positive predictions model makes out of total actual positive instances

F1 Score(Decimal between 0 or 1)
- Accuracy of the model using both false positives and false negatives

Perplexity((Integer)
- Models' ability to predict probability distribution of data compared to data's actual distribution

BLEU Score(Decimal between 0 or 1)
- measure the quality of text the model generates by comparing the model's text with one or more reference texts

ROUGE Score(Decimal between 0 or 1)
- measures how similar the generated text and reference text are based on the model's ability to evaluate summarization and translation
- higher the scores mean the model is summarize and translating better

Mean Square Error(Decimal)
- average of square differences between a model's predicted and actual values

Area Under the Curve(Decimal between 0 to 1)
- measures the model's ability to distinguish classes
- higher AUC means the model is performing better

Latency(Integer in seconds or milliseconds)
- How long a model takes to make a prediction
- Lower value indicates the model is performing faster
Accuracy >= 85%

Precision >= 85%

Recall >= 85%

F1 Score >= 85%

Perplexity <= 20

BLEU Score >= 0.75

ROUGE Score >= 0.75

Mean Square Error <= 0.05

Area Under the Curve >= 0.90

Latency <= 200 ms

Conclusion

 

Alert AI 

What is at stake AI & Gen AI in Business? We are addressing exactly that. Generative AI security solution for Healthcare , Pharma, Insurance, Life Sciences, Retail, Banking, Finance, Manufacturing.

Alert AI is end-to-end, Interoperable Generative AI security platform to help enhance security of Generative AI applications and workflows. against potential adversaries, model vulnerabilities, privacy, copyright and legal exposures, sensitive information leaks, Intelligence and data exfiltration, infiltration at training and inference, integrity attacks in AI applications, anomalies detection and enhanced visibility in AI pipelines. forensics, audit,AI  governance in AI footprint.

Despite the Security challenges, the promise of large language models is enormous.
We are committed to enabling industries and enterprises to reap the benefits of large language models.

Retrieval Augumented Generative (RAG) Model and RisksGen AI security, Generative AI security,Security for Gen AI LLM security,Model security,Prompt security,RAG security,AI vulnerabilities, vulnerabilities in AI AI risks, GenAI risks, risks in GenAI,AI privacy, Privacy in AI,AI pipeline security GEN AI in industries,GEN AI solutions,LLM Testing, GenAI testing, Adversarial attacks,owasp riskstraining evaluation inference alertsData Spills, Leaks, Contamination in AI Pipelines

Alert AI

Alert AI is end-to-end, Interoperable Generative AI security platform to help enhance security of Generative AI applications and workflows against potential adversaries, model vulnerabilities, privacy, copyright and legal exposures, sensitive information leaks, Intelligence and data exfiltration, infiltration at training and inference, integrity attacks in AI applications, anomalies detection and enhanced visibility in AI pipelines. forensics, audit,AI  governance in AI footprint.

Alert AI Generative AI security platform

What is at stake AI & Gen AI in Business? We are addressing exactly that.

Generative AI security solution for Healthcare, Insurance, Retail, Banking, Finance, Life Sciences, Manufacturing.

Despite the Security challenges, the promise of Generative AI is enormous.

We are committed to enhance the security of Generative AI applications and workflows in industries and enterprises to reap the benefits .

Alert AI Generative AI Security Services

 

 

 

ALERT AI Generative AI Security platform, AI Privacy, LLM Vulnerabilities, Adversarial Risks, GenAI security, ALERT AI

 

Alert AI  360 view and Detections

  • Alerts and Threat detection in AI footprint
  • LLM & Model Vulnerabilities Alerts
  • Adversarial ML  Alerts
  • Prompt, response security and Usage Alerts
  • Sensitive content detection Alerts
  • Privacy, Copyright and Legal Alerts
  • AI application Integrity Threats Detection
  • Training, Evaluation, Inference Alerts
  • AI visibility, Tracking & Lineage Analysis Alerts
  • Pipeline analytics Alerts
  • Feedback loop
  • AI Forensics
  • Compliance Reports

 

End-to-End GenAI Security

  • Data alerts
  • Model alerts
  • Pipeline alerts
  • Evaluation alerts
  • Training alerts
  • Inference alerts
  • Model Vulnerabilities
  • Llm vulnerabilities
  • Privacy
  • Threats
  • Resources
  • Environments
  • Governance and compliance

 

Enhace, Optimize, Manage Generative AI security of Business applications

  • Manage LLM, Model, Pipeline, Prompt Vulnerabilities
  • Enhance Privacy
  • Ensure integrity
  • Optimize domain-specific security guardrails
  • Discover Rogue pipelines, models, Rogue prompts
  • Block Hallucination and Misinformation attack
  • Block prompts harmful Content Generation
  • Block Prompt Injection
  • Detect robustness risks,  perturbation attacks
  • Detect output re-formatting attacks
  • Stop information disclosure attacks
  • Track to source of origin training Data
  • Detect Anomalous behaviors
  • Zero-trust LLM’s
  • Data protect GenAI applications
  • Secure access to tokenizers
  • Prompt Intelligence Loss prevention
  • Enable domain-specific policies, guardrails
  • Get Recommendations
  • Review issues
  • Forward  AI incidents to SIEM
  • Audit reports — AI Forensics
  • Findings, Sources, Posture Management.
  • Detect and Block Data leakage breaches
  • Secure access with Managed identities

 

Security Culture of 360 | Embracing Change.

In the shifting paradigm of Business heralded by rise of Generative AI ..

360 is culture that emphasizes security in the time of great transformation.

Our commitment to our customers is represented by our culture of 360.

Organizations need to responsibly assess and enhance the security of their AI environments development, staging, production for Generative AI applications and Workflows in Business.

Despite the Security challenges, the promise of Generative AI is enormous.

We are committed to enhance the security of Generative AI applications and workflows in industries and enterprises to reap the benefits.

Home  Services  Resources  Industries

READ FROM INDUSTRY

OUR TESTIMONIALS


According our Customers, We make difference

SEND US A MESSAGE

CONTACT US


We are seeking to work with exceptional people who adopt, drive change. We want to know from you to understand Generative AI in business better to secure better.
``transformation = solutions + industry minds``

Hours:

Mon-Fri: 8am – 6pm

Phone:

1+(408)-364-1258

Address:

We are at the heart of Silicon valley few blocks form Cisco and other companies.

Exit I-880 and McCarthy blvd Milpitas, CA 95035

SEND EMAIL