Source: arXiv:2605.03042 — ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration (Shanghai Jiao Tong University, May 2026)
Core Premise
Any long-term task performed by a single agent is unreliable.
The failure mode ARIS solves isn’t “the agent crashes.” It’s plausible unsupported success — an agent produces a confident, coherent output that is subtly wrong, hallucinated, or unsupported by actual evidence.
Three Architectural Layers
1. Execution Layer — 65+ modular “skills” encoded as plain Markdown files (SKILL.md). Each skill has: inputs, outputs, step-by-step procedures, quality gates, and failure-handling instructions. Skills exchange state via versioned plain-text artifact files (not in-memory or DB), enabling checkpoint-based recovery across sessions.
2. Orchestration Layer — chains skills into end-to-end workflows. Workflows are resumable at any intermediate artifact (not monolithic scripts).
3. Assurance Layer — makes outputs trustworthy rather than just plausible. Structurally independent from the execution layer.
Key Pattern: Cross-Model Adversarial Collaboration
A critique-to-action loop between two agents from different model families:
- Executor (e.g., Claude) drives forward progress
- Reviewer (e.g., GPT-5.4) critiques intermediate artifacts independently
Why different families? Same-model self-refinement loops share inductive biases — the generator and validator make correlated errors that neither catches. Cross-family critique produces genuinely divergent feedback.
Critical protocol: The reviewer reads artifacts directly, not via an executor-provided summary. If the executor summarizes first, the reviewer critiques the executor’s framing, not the underlying work.
Loop terminates when: score > 6/10 AND all critical items resolved, OR max 4 rounds reached.
The Assurance Stack
Evidence-to-Claim Audit Cascade (3 stages)
Stage 1: Experiment Integrity Audit A cross-model reviewer checks for 5 failure modes:
- Model-derived reference labels (outputs used as ground truth)
- Self-normalized scores (metrics that inflate by design)
- Phantom results (claimed numbers not in actual output files)
- Dead-code inflation (metrics defined but never executed)
- Scope inflation (claims that generalize beyond what was tested)
Stage 2: Result-to-Claim Mapping Each claim gets a verdict: supported, partially supported, or invalidated. Claims with Stage 1 failures cannot be marked fully supported.
Stage 3: Paper-Claim Audit A fresh, zero-context reviewer (new thread, no history) cross-checks quantitative claims against raw evidence. Fresh-thread design prevents accumulated context from biasing the audit.
Manuscript Assurance (additional checks)
- Five-pass scientific editing pipeline (clutter, active voice, structure, terminology consistency, numerical consistency)
- Proof verification with 20-category issue taxonomy
- Visual PDF review (catches layout issues source-only review misses)
- Citation audit: existence, metadata correctness, and context appropriateness
Effort Presets
| Level | Multiplier | Use |
|---|---|---|
| lite | ~0.4x | Quick exploration |
| balanced | 1x | Default |
| max | ~2.5x | Thorough review |
| beast | ~5–8x | Maximum depth |
Reviewer reasoning effort (GPT-5.4 xhigh) stays constant regardless of preset — effort scaling changes coverage/iteration counts, not reviewer quality.
Persistent Memory: Research Wiki
Four entity types stored as structured Markdown: papers, ideas, experiments, claims. Eight typed relationships: extends, contradicts, addresses_gap, inspired_by, tested_by, supports, invalidates, supersedes.
Key design: rejected ideas are retained as a banlist. Without persistent memory, ideation pipelines re-propose the same dead-end directions across sessions.
All state lives in versioned text files on disk — not in LLM context windows — enabling cross-session continuity and checkpoint recovery.
Actionable Patterns for Production AI Apps
- Use heterogeneous reviewers. Route GPT → Claude reviewer or Claude → GPT reviewer. Correlated blind spots are real.
- Reviewer reads artifacts directly. Don’t pass the executor’s summary — give the reviewer the raw output.
- Fresh context for audits. Use a new thread/context for critical verification. Prior conversation history creates confirmation bias.
- Track claims explicitly. Maintain a claim ledger mapping every factual/quantitative output to its supporting evidence. Don’t trust the final prose.
- Persistent memory over ephemeral context. State in versioned files enables checkpoint recovery and cross-session continuity.
- Cap review rounds. Over-iterating with the same reviewer causes the executor to overfit to reviewer preferences rather than actual quality improvement.
Limitations
- Assurance stack is advisory, not formal verification
- Repository-level review sends code to external APIs (confidentiality risk)
- Cross-family vs. same-family review hasn’t been rigorously benchmarked yet (listed as future work)
- Human responsibility remains: ARIS automates execution and review loops; humans provide research direction and make final submission decisions