Six-Dimension Art Evaluation Rubric

Source paper: Learning-based Artificial Intelligence Artwork: Methodology Taxonomy and Quality Evaluation, ACM Computing Surveys (2024). Origin The rubric was built from art vocabulary and traditional principles of painting analysis, then validated through a user study to confirm the weightings felt reasonable across different artwork types. The goal was a consistent, repeatable way to evaluate AI-generated artworks across different styles. The Six Dimensions Beauty (50%) — The dominant criterion. Encompasses overall compositional harmony: balance, proportion, the arrangement of visual elements, and the pleasing relationship between subjects. An image can score well on every other dimension and still feel wrong if the composition is off. This is where Gestalt principles are most directly applied — does the whole hang together? ...

May 14, 2026 · 3 min

Rubric: Meaning and Origin

A rubric is a scoring guide or evaluation framework that breaks down quality into specific, defined criteria. It provides a structured way to assess something by listing what to look for and, often, how much weight each criterion carries — rather than relying on a vague overall impression. In everyday use, rubrics appear most commonly in education (e.g., grading rubrics for essays) and in evaluation contexts where consistent, transparent judgment is needed. ...

May 14, 2026 · 1 min

LLM as Judge

Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner. Why it exists For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don’t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they’re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines. ...

May 14, 2026 · 2 min

Deterministic Graders (for LLM / AI Evaluation)

Definition A deterministic grader is an evaluation function that produces the same result every time for the same input — no randomness, no LLM-in-the-loop judgment. You check the model’s output against a fixed, code-based rule. Concrete Examples Exact string match — “Does the output equal Paris?” Regex match — “Does the output contain a valid ISO date?” Structured-output validation — “Does this parse as JSON and pass the schema?” Code execution / unit tests — “Run the generated function against these test cases. Did they pass?” Numeric tolerance — “Is the answer within 0.01 of the expected value?” Set membership — “Is the classification label one of {positive, negative, neutral}?” Contrast: Model-Graded / LLM-as-Judge The opposite approach is a model-graded (or “LLM-as-judge”) evaluator, where you ask another model something like “Is this answer helpful and correct?” ...

April 24, 2026 · 2 min