When Do Tools and Planning Help Large Language Models Think?

A Cost- and Latency-Aware Benchmark

Published IEEE, March 2026 Arxiv

Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan–execute–replan agent equipped with task-specific tools (DBpedia SPARQL Protocol and RDF Query Language (SPARQL)/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search).