๐Ÿ“Š LLM Evaluation Framework for Professional Content Rewriting

Evaluate the quality of LLM-generated content using multiple metrics with proper normalization.

๐Ÿ“ฅ Input Options

Input Mode

โš™๏ธ Configuration

Select Model
Select Prompt Template

๐Ÿ“„ Text Comparison

๐Ÿ“ˆ Evaluation Metrics

Evaluation Metrics

๐Ÿ“Œ Overall Assessment

Hybrid Score Interpretation

The Hybrid Score combines multiple evaluation metrics into a single score with proper normalization:

  • 0.85+: Outstanding performance (A) - ready for professional use
  • 0.70-0.85: Strong performance (B) - good quality with minor improvements
  • 0.50-0.70: Adequate performance (C) - usable but needs refinement
  • 0.30-0.50: Weak performance (D) - requires significant revision
  • <0.30: Poor performance (F) - likely needs complete rewriting

Key Metrics Explained

Metric What It Measures Why It Matters
AnswerRelevancy Is output on-topic with input? Does the prompt stay focused despite messy input?
Faithfulness Are ALL facts preserved correctly? Does it maintain accuracy when input has encoding errors?
GEval Overall quality assessment by another AI How professional does the output appear?
BERTScore Semantic similarity to reference How well does it capture the meaning of cleaned text?
ROUGE Content overlap with reference How much key information is preserved?
BLEU Phrasing precision How closely does wording match human-quality standard?
METEOR Linguistic quality with synonyms How natural does the cleaned output read?