๐ LLM Evaluation Framework for Professional Content Rewriting
Evaluate the quality of LLM-generated content using multiple metrics with proper normalization.
๐ฅ Input Options
โ๏ธ Configuration
๐ Text Comparison
๐ Evaluation Metrics
Evaluation Metrics
๐ Overall Assessment
Hybrid Score Interpretation
The Hybrid Score combines multiple evaluation metrics into a single score with proper normalization:
- 0.85+: Outstanding performance (A) - ready for professional use
- 0.70-0.85: Strong performance (B) - good quality with minor improvements
- 0.50-0.70: Adequate performance (C) - usable but needs refinement
- 0.30-0.50: Weak performance (D) - requires significant revision
- <0.30: Poor performance (F) - likely needs complete rewriting
Key Metrics Explained
| Metric | What It Measures | Why It Matters |
|---|---|---|
| AnswerRelevancy | Is output on-topic with input? | Does the prompt stay focused despite messy input? |
| Faithfulness | Are ALL facts preserved correctly? | Does it maintain accuracy when input has encoding errors? |
| GEval | Overall quality assessment by another AI | How professional does the output appear? |
| BERTScore | Semantic similarity to reference | How well does it capture the meaning of cleaned text? |
| ROUGE | Content overlap with reference | How much key information is preserved? |
| BLEU | Phrasing precision | How closely does wording match human-quality standard? |
| METEOR | Linguistic quality with synonyms | How natural does the cleaned output read? |