{"results":[{"id":"atms-de-kleer-1986","text":"de Kleer (1986) ATMS uses assumption-based environments and nogoods. TMS beats ATMS for EEM because revision matters more than multiple environments when the problem solver (LLM) produces 13-37% errors","truth_value":"IN","justification_count":0,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"automation-evidence","text":"Three of six substrate derive-review-research cycles (bare-metal, AWS, OpenShift) were run by an autonomous Claude session with no human intervention. Results were consistent with manually-run cycles: same four failure categories, same proportional breakdown, same repair strategies. The process is mechanical enough for a single LLM session to execute the full loop.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"belief-registry-externalizes-critique","text":"The belief registry externalizes and persists the critic's judgments. Instead of relying on the same LLM to both generate and evaluate (which fails — self-critique damages accuracy -3pp to -41.5pp), the registry stores review outcomes as truth values (IN/OUT), retraction records, and nogoods. The critic's work survives across sessions and is available to any model.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"beliefs-cli-vs-reasons-cli","text":"Two CLIs at different levels: beliefs CLI is a structured markdown KB with provenance and manual maintenance (simple, flat). reasons CLI (ftl-reasons) is a full TMS with automatic propagation, cascades, backtracking, and LLM-driven operations (powerful, dependency-aware). Use beliefs for independent facts, reasons for justified conclusions with dependency chains","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"confidence-unreliable","text":"LLM self-assessed confidence does not reliably track accuracy. Confirmed across 4 models (corrected results): Opus r=0.280, Sonnet r=0.223, Flash r=0.267, Pro r=0.137. Confidence explains only 2-8% of variance. Revision based on self-assessment damages accuracy in all 4 models (-3pp to -41.5pp). Same structural flaw as human overconfidence (Kahneman) — answer and confidence come from the same process.","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"context-opacity","text":"The human cannot track what the LLM currently has in context. Context windows are opaque — the human sees their messages and tool results but not the model's internal state, compaction decisions, or what was silently dropped.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"continuity-human-problem","text":"The human cannot track what the LLM currently has in context. Context windows are opaque and compaction destroys justification networks. EEM solves this via visibility and persistence — the human can always inspect the current belief state regardless of what the model has in context.","truth_value":"IN","justification_count":2,"dependent_count":1,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"invalid","source_type":""},{"id":"credibility-is-presentation-problem","text":"The credibility gap on llmeem.ai is a presentation problem, not a substance problem. The evidence (eval harnesses, question sets, raw results, methodology writeups) exists but is not linked or publicly accessible. Fixing credibility requires linking to existing evidence, not generating new evidence.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"cross-model-portability","text":"EEM works across model providers and sizes. The same belief network can be queried by Claude, Gemini, local models, or any LLM that can read text. Model upgrades, provider swaps, and cost optimization (Opus→Haiku) preserve all knowledge. The beliefs are plain text with structure — no model-specific format.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"eem-works","text":"EEM measurably and dramatically improves LLM performance on domain tasks. The core research question is answered: yes","truth_value":"IN","justification_count":1,"dependent_count":3,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"unnecessary","source_type":""},{"id":"evidence-exists-but-not-linked","text":"The eval harnesses, question sets, JSON result files, Langfuse traces, and methodology writeups all exist in project repos (beliefs-pi, expert-service, claude_code_langgraph). They are not public or linked from llmeem.ai. The credibility gap is a presentation problem, not a substance problem.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"four-failure-categories","text":"LLM derivation produces exactly four categories of error, validated across 6 infrastructure domains (939 derivations, 219 invalid): smuggled premises (41-54% of invalids — facts from parametric knowledge not cited in antecedents), false causal links (20-26% — independent capabilities asserted as integrated), unsupported superlatives (8-24% — strength claims overstating sources), domain conflation (13-15% — properties of one system applied to another). No new category has appeared. The taxonomy is complete.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"ftl-reasons-implementation","text":"ftl-reasons implements: SL justifications with antecedents and outlists, BFS propagation cascades with restoration, entrenchment-scored dependency-directed backtracking, challenge/defend dialectical argumentation (challenge→OUT, defend neutralizes, multi-level chains), LLM-driven derive, review-beliefs, and contradiction detection. SQLite-backed, Python CLI.","truth_value":"IN","justification_count":0,"dependent_count":2,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"ftl-reasons-is-tms","text":"ftl-reasons implements actual Doyle-style TMS architecture: SL justifications with antecedents and outlists, BFS propagation cascades with restoration, entrenchment-scored dependency-directed backtracking. LLMs fill the problem-solver role Doyle left open","truth_value":"IN","justification_count":2,"dependent_count":6,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"pass","source_type":""},{"id":"generate-and-critique","text":"LLMs are extraordinary generators but unreliable critics. The belief registry externalizes and persists the critic's judgments, replacing internal self-assessment with external structured tracking","truth_value":"IN","justification_count":3,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"invalid","source_type":""},{"id":"how-agents-use-eem","text":"LLM agents use EEM by: querying beliefs via search/show/explain before answering, citing node IDs for auditability, running derive to generate new beliefs from existing ones, running review-beliefs to self-audit, recording nogoods when contradictions appear. The agent does not need to be told it is an expert — the knowledge base speaks for itself","truth_value":"IN","justification_count":2,"dependent_count":0,"challenges":[],"last_reviewed":"2026-05-30T07:02:40","review_result":"unnecessary","source_type":""},{"id":"hybrid-tms","text":"ftl-reasons is a hybrid TMS: symbolic TMS handles structure (justifications, propagation, cascades, backtracking, challenge/defend) while LLMs handle semantic operations (derive generates beliefs, review-beliefs critiques them, contradiction detection finds nogoods)","truth_value":"IN","justification_count":1,"dependent_count":5,"challenges":[],"last_reviewed":"2026-05-29T17:30:21","review_result":"pass","source_type":""},{"id":"independent-validation","text":"Karpathy's LLM Wiki (2026) independently arrived at the same diagnosis (RAG is stateless waste) and same general solution (persistent structured knowledge). EEM goes further with justification chains, retraction cascades, and controlled eval data. When independent researchers arrive at the same architecture from different starting points, the architecture is probably right.","truth_value":"IN","justification_count":1,"dependent_count":0,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"karpathy-llm-wiki","text":"Andrej Karpathy (2026) independently proposed an 'LLM Wiki' — a persistent structured knowledge base that an LLM incrementally builds and maintains instead of re-discovering knowledge from scratch. Same diagnosis (RAG is stateless waste), same general solution (persistent knowledge artifact). EEM goes further: justification chains, retraction cascades, measured results across 4 model families. Independent convergence from a credible source validates the core insight.","truth_value":"IN","justification_count":0,"dependent_count":1,"challenges":[],"last_reviewed":null,"review_result":null,"source_type":""},{"id":"llm-as-problem-solver","text":"Putting an LLM in the TMS problem-solver slot (generator via derive, critic via review-beliefs and contradiction detection) is what Doyle's architecture prescribes. The open question is whether an LLM is a good problem solver, not whether using one is faithful to the design","truth_value":"IN","justification_count":1,"dependent_count":1,"challenges":[],"last_reviewed":"2026-05-29T17:30:21","review_result":"pass","source_type":""}],"count":27,"limit":20,"offset":0}