{"results":[{"id":"automated-overnight-construction","text":"The derive-review-research cycle is mechanical enough to run unattended. Three of six substrate validations were run by an autonomous Claude session with no human intervention, producing consistent results. The target: kick off expert-build derive-review-repair in the evening, wake up to a reviewed knowledge base. Learn while you sleep, build while you're awake.","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"invalid","source_type":""},{"id":"evidence-beliefs-ablation","text":"Beliefs alone outperform beliefs + expert prompt: Opus 100% vs 94.2% (+5.8pp), Sonnet 94.2% vs 91.8% (+2.4pp). Adding expert prompt hurts — agent trusts its 'expertise' instead of consulting the knowledge base","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-exists-but-not-linked","text":"The eval harnesses, question sets, JSON result files, Langfuse traces, and methodology writeups all exist in project repos (beliefs-pi, expert-service, claude_code_langgraph). They are not public or linked from llmeem.ai. The credibility gap is a presentation problem, not a substance problem.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-expert-vs-baseline","text":"Expert-service with EEM scores 88% A-grade vs an agent pipeline 33% on same 50 questions, 15x faster","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-retraction-rate","text":"13-37% of derived beliefs are retracted per review round across multiple expert KBs. Self-correction works — the system finds and removes its own errors","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"expert-agent-builder-repo","text":"expert-agent-builder automates the knowledge pipeline: fetch docs → generate entries → extract beliefs → derive → review. Install: pip install expert-agent-builder or uv tool install expert-agent-builder. Source and issues: https://github.com/benthomasson/expert-agent-builder","truth_value":"IN","justification_count":0,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"expert-pipeline","text":"Expert pipeline: chunk source material → propose beliefs → human accepts → derive connections → review derivations → export. Value accrues at each stage, with derive producing new knowledge (connections the source doesn't make explicit)","truth_value":"IN","justification_count":2,"dependent_count":1,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"unnecessary","source_type":""},{"id":"expert-pipeline-design","text":"The expert-build pipeline implements: fetch-docs (source → markdown), summarize (entries from sources), propose-beliefs (extract candidates), accept-beliefs (human gate), derive (find connections), review-beliefs (audit derivations), export (beliefs.md/JSON). Value accrues at each stage — derive produces new knowledge that no single source document states.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"expert-prompt-paradox","text":"Telling an agent it is an expert reduces belief utilization. The humble generic prompt produces better results because the agent consults the knowledge base instead of trusting its 'expertise'","truth_value":"IN","justification_count":1,"dependent_count":1,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"pass","source_type":""},{"id":"fabricated-specificity-rate-8-percent","text":"8% of premises contain fabricated details the source never mentioned (propose-beliefs Phase 2 experiment, 100 randomly sampled premises from handbook-expert). Dominant error type is embellishment, not contradiction — the proposer adds plausible but unsupported specifics (e.g., Redis when source says nothing about storage backend).","truth_value":"IN","justification_count":0,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"how-agents-use-eem","text":"LLM agents use EEM by: querying beliefs via search/show/explain before answering, citing node IDs for auditability, running derive to generate new beliefs from existing ones, running review-beliefs to self-audit, recording nogoods when contradictions appear. The agent does not need to be told it is an expert — the knowledge base speaks for itself","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"unnecessary","source_type":""},{"id":"http-endpoint-access","text":"EEM is accessible via a single HTTP GET at https://expert.ftl2.com/public/eem-expert/beliefs — no Python library, no CLI installation, no database copy, no setup. Three formats available: HTML (human-browsable), Markdown (agent-readable), JSON (machine-readable). Any agent that can fetch a URL can consume justified beliefs immediately.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"model-stacking-evidence","text":"Multi-pass agent pattern observed: Model A generates candidates, TMS records with provenance, review critiques (machine + human), Model B receives validated beliefs, Model B derives new beliefs. Demonstrated in expert-build pipeline where Sonnet summarizes sources, then Sonnet derives, then Sonnet reviews — each pass gets fresh context with the TMS as the persistent layer between passes.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"no-code-adoption","text":"Three levels of EEM integration: (1) HTTP GET — just a URL, read beliefs as context, no installation. (2) CLI (ftl-reasons) — search, show, explain, pip install. (3) Full pipeline (expert-build) — build, derive, review, maintain. The HTTP level is the on-ramp: try EEM in 30 seconds, see if it helps, no commitment.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"scale-data","text":"40+ expert knowledge bases built across domains. Smallest: aap-expert (237 beliefs). Largest: redhat-expert (12,511 nodes across 6 departments/expert agents, 11,897 IN after repair). Domains include enterprise products, codebases, research papers, certification curricula, cloud infrastructure (AWS, GCP, Azure, OpenShift, bare-metal, Hetzner).","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"scale-evidence","text":"EEM scales from small domains (237 beliefs, aap-expert) to large enterprises (12,731 beliefs, redhat-expert). 40+ expert knowledge bases built across code, product, project, and domain-specific experts","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-29T17:30:21","review_result":"invalid","source_type":""},{"id":"self-referential-evidence","text":"The EEM evidence base is currently self-referential: all experimental results (98.5% dual-path, 88% vs 33% expert-vs-baseline, confidence r-values) are sourced from the same project's internal entries. The belief registry demonstrating itself on its own claims is circular — structure and provenance tracking work, but the underlying numbers lack independent verification","truth_value":"IN","justification_count":0,"dependent_count":3,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"site-lacks-methodology-links","text":"llmeem.ai presents headline numbers (98.5% A/B across 3,853 questions, 88% vs 33% expert-vs-baseline, confidence r-values) without links to methodology, question sets, scoring rubrics, or raw result data. The numbers appear as unverified marketing claims.","truth_value":"IN","justification_count":0,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"structure-not-truth-applies-to-site","text":"The expert.ftl2.com belief explorer demonstrates that TMS mechanics work (justification chains, IN/OUT propagation, retraction cascades) but does not prove the underlying claims are correct. A belief can be IN and fully justified within the system while being wrong, because all antecedents trace back to the same author's observations. Structure proves the tooling; external evidence proves the claims.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"training-finetuning-cost-comparison","text":"Model fine-tuning costs $10K-$100K+ for a single domain adaptation, requires ML expertise, and produces a model locked to one provider. Training from scratch costs millions. EEM construction costs ~$300 (Sonnet) to ~$1,500 (Opus) for enterprise scale (13,511 beliefs, 6 departments), requires no ML expertise, and produces a portable knowledge artifact usable by any model. EEM is 10-100x cheaper than fine-tuning and works across providers.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""}],"count":20,"limit":20,"offset":0}