๐ LLM Evaluation Framework for Professional Content Rewriting
Evaluate the quality of LLM-generated content using multiple metrics with proper normalization.
๐ฅ Input Options
โ๏ธ Configuration
๐ Text Comparison
๐ Evaluation Metrics
Evaluation Metrics
๐ Overall Assessment
Hybrid Score Interpretation
The Hybrid Score combines multiple evaluation metrics into a single score with proper normalization:
- 0.85+: Outstanding performance (A) - ready for professional use
- 0.70-0.85: Strong performance (B) - good quality with minor improvements
- 0.50-0.70: Adequate performance (C) - usable but needs refinement
- 0.30-0.50: Weak performance (D) - requires significant revision
- <0.30: Poor performance (F) - likely needs complete rewriting
Key Metrics Explained
Metric | What It Measures | Why It Matters |
---|---|---|
AnswerRelevancy | Is output on-topic with input? | Does the prompt stay focused despite messy input? |
Faithfulness | Are ALL facts preserved correctly? | Does it maintain accuracy when input has encoding errors? |
GEval | Overall quality assessment by another AI | How professional does the output appear? |
BERTScore | Semantic similarity to reference | How well does it capture the meaning of cleaned text? |
ROUGE | Content overlap with reference | How much key information is preserved? |
BLEU | Phrasing precision | How closely does wording match human-quality standard? |
METEOR | Linguistic quality with synonyms | How natural does the cleaned output read? |