{"results":[{"id":"automation-evidence","text":"Three of six substrate derive-review-research cycles (bare-metal, AWS, OpenShift) were run by an autonomous Claude session with no human intervention. Results were consistent with manually-run cycles: same four failure categories, same proportional breakdown, same repair strategies. The process is mechanical enough for a single LLM session to execute the full loop.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"credibility-is-presentation-problem","text":"The credibility gap on llmeem.ai is a presentation problem, not a substance problem. The evidence (eval harnesses, question sets, raw results, methodology writeups) exists but is not linked or publicly accessible. Fixing credibility requires linking to existing evidence, not generating new evidence.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"dual-path-design-evidence","text":"Dual-path retrieval (TMS path for pre-computed beliefs + FTS path for source chunk search, merged by a third pass) achieves 98.5% A/B across 3,853 questions. Opus drops from 95.5% to 86% when mixing beliefs and document chunks in a single prompt; three focused passes achieve 100%.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-beliefs-ablation","text":"Beliefs alone outperform beliefs + expert prompt: Opus 100% vs 94.2% (+5.8pp), Sonnet 94.2% vs 91.8% (+2.4pp). Adding expert prompt hurts — agent trusts its 'expertise' instead of consulting the knowledge base","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-depth-ceiling","text":"Beliefs beyond depth 8 do not survive review. Retraction rate: 0% at depth 0, rising to 100% at depth 9+. The universal TMS is wide rather than deep","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-dual-path","text":"Opus + dual-path architecture achieves 98.5% A/B across 3,853 questions. Zero D/F grades — eliminated the failure tail entirely","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-exists-but-not-linked","text":"The eval harnesses, question sets, JSON result files, Langfuse traces, and methodology writeups all exist in project repos (beliefs-pi, expert-service, claude_code_langgraph). They are not public or linked from llmeem.ai. The credibility gap is a presentation problem, not a substance problem.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-expert-vs-baseline","text":"Expert-service with EEM scores 88% A-grade vs an agent pipeline 33% on same 50 questions, 15x faster","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-model-compensation","text":"EEM compensates for model size: Sonnet+beliefs approximates Opus without beliefs. Haiku with dual-path achieves 94% A+B, matching Opus at 98%","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"evidence-retraction-rate","text":"13-37% of derived beliefs are retracted per review round across multiple expert KBs. Self-correction works — the system finds and removes its own errors","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"how-to-start","text":"To start using EEM: (1) reasons init — creates reasons.db, (2) add premises from observations with reasons add, (3) add justified conclusions with --sl to link dependencies, (4) use reasons derive to find connections, (5) use reasons review-beliefs to audit, (6) retract when evidence changes and let cascades propagate","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-29T17:30:21","review_result":"invalid","source_type":""},{"id":"model-stacking-evidence","text":"Multi-pass agent pattern observed: Model A generates candidates, TMS records with provenance, review critiques (machine + human), Model B receives validated beliefs, Model B derives new beliefs. Demonstrated in expert-build pipeline where Sonnet summarizes sources, then Sonnet derives, then Sonnet reviews — each pass gets fresh context with the TMS as the persistent layer between passes.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"scale-evidence","text":"EEM scales from small domains (237 beliefs, aap-expert) to large enterprises (12,731 beliefs, redhat-expert). 40+ expert knowledge bases built across code, product, project, and domain-specific experts","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-29T17:30:21","review_result":"invalid","source_type":""},{"id":"self-referential-evidence","text":"The EEM evidence base is currently self-referential: all experimental results (98.5% dual-path, 88% vs 33% expert-vs-baseline, confidence r-values) are sourced from the same project's internal entries. The belief registry demonstrating itself on its own claims is circular — structure and provenance tracking work, but the underlying numbers lack independent verification","truth_value":"IN","justification_count":0,"dependent_count":3,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"source-quality-determines-derivation","text":"Source document quality is the primary determinant of derivation quality. Technical/operational documentation (procedures, configurations, specs) produces 13-31% invalid rates with mostly softenable errors. Marketing/strategy documentation (assertions without evidence) produces up to 70% invalid rates with destructive retractions. The derive-review pipeline functions as a document quality assay.","truth_value":"IN","justification_count":0,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"structure-not-truth-applies-to-site","text":"The expert.ftl2.com belief explorer demonstrates that TMS mechanics work (justification chains, IN/OUT propagation, retraction cascades) but does not prove the underlying claims are correct. A belief can be IN and fully justified within the system while being wrong, because all antecedents trace back to the same author's observations. Structure proves the tooling; external evidence proves the claims.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""}],"count":16,"limit":20,"offset":0}