What is a BLEU score and what are its limitations for evaluating natural language generation?

Prepare for the AI Prompt Engineering Test with detailed flashcards and insightful questions. Master key Machine Learning and NLP concepts with explanations for every query. Ace your exam!

Multiple Choice

What is a BLEU score and what are its limitations for evaluating natural language generation?

Explanation:
BLEU score measures how much a generated piece of text overlaps with reference texts by checking shared n-grams. It typically looks at 1- to 4-grams, combines those precisions (often via a geometric mean), and applies a brevity penalty to discourage overly short outputs. The idea is that higher overlap with good references suggests higher quality, but it’s a proxy rather than a perfect judge. The main limitations are that BLEU doesn’t evaluate meaning, fluency, or overall readability directly. It cares about surface form, so it can miss correct paraphrases or fluent rewrites that don’t match the exact reference wording. It’s also highly dependent on the reference set: the variety, quality, and number of references shape the scores a system gets, which means it can misrepresent true quality if the references are narrow or biased. Because of these factors, BLEU often does not align perfectly with human judgments, especially across domains or languages with rich variation. So, while BLEU is a useful automatic tool for quick comparisons, it should be interpreted alongside human evaluations or other metrics. The claim that it requires no reference texts is incorrect, since BLEU relies on reference texts to compute n-gram overlap.

BLEU score measures how much a generated piece of text overlaps with reference texts by checking shared n-grams. It typically looks at 1- to 4-grams, combines those precisions (often via a geometric mean), and applies a brevity penalty to discourage overly short outputs. The idea is that higher overlap with good references suggests higher quality, but it’s a proxy rather than a perfect judge.

The main limitations are that BLEU doesn’t evaluate meaning, fluency, or overall readability directly. It cares about surface form, so it can miss correct paraphrases or fluent rewrites that don’t match the exact reference wording. It’s also highly dependent on the reference set: the variety, quality, and number of references shape the scores a system gets, which means it can misrepresent true quality if the references are narrow or biased. Because of these factors, BLEU often does not align perfectly with human judgments, especially across domains or languages with rich variation.

So, while BLEU is a useful automatic tool for quick comparisons, it should be interpreted alongside human evaluations or other metrics. The claim that it requires no reference texts is incorrect, since BLEU relies on reference texts to compute n-gram overlap.

Subscribe

Get the latest from Examzify

You can unsubscribe at any time. Read our privacy policy