When Do Tools and Planning Help Large Language Models Think?

A Cost- and Latency-Aware Benchmark

Published IEEE, March 2026 Arxiv

Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan–execute–replan agent equipped with task-specific tools (DBpedia SPARQL Protocol and RDF Query Language (SPARQL)/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search).

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762

  2. Z. Wang, Z. Chu, T. V. Doan, S. Ni, M. Yang, and W. Zhang, “History, development, and principles of large language models—an introductory survey,” arXiv preprint arXiv:2402.06853, 2024. [Online]. Available: https://arxiv.org/abs/2402.06853

  3. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” arXiv preprint arXiv:2201.11903, 2022. [Online]. Available: https://arxiv.org/abs/2201.11903

  4. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022. [Online]. Available: https://arxiv.org/abs/2203.11171

  5. C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv:2408.03314, 2024. [Online]. Available: https://arxiv.org/abs/2408.03314

  6. N. Muennighoff, Z. Yang, W. Shi, X. L. Li, F. Li, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candes, and T. Hashimoto, “s1: Simple test-time scaling,” arXiv preprint arXiv:2501.19393, 2025. [Online]. Available: https://arxiv.org/abs/2501.19393

  7. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint arXiv:2305.10601, 2023.[Online]. Available: https://arxiv.org/abs/2305.10601

  8. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W. Yih, T. Rockt¨ aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html

  9. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022. [Online]. Available: https://arxiv.org/abs/2210.03629

  10. T. Schick, J. Dwivedi-Yu, R. Dess` ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv preprint arXiv:2302.04761, 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

  11. L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig, “PAL: Program-aided language models,” arXiv preprint arXiv:2211.10435, 2022. [Online]. Available: https: //arxiv.org/abs/2211.10435

  12. W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588, 2022. [Online]. Available: https://arxiv.org/abs/2211.12588

  13. T. S. Costa, S. Gottschalk, and E. Demidova, “Event-qa: A dataset for event-centric question answering over knowledge graphs,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM), 2020, pp. 3157–3164. [Online]. Available: https://arxiv.org/abs/2004.11861

  14. C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil, and L. Lee, “Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions,” in Proceedings of the 25th International Conference on World Wide Web (WWW), 2016, pp. 613–624. [Online]. Available: https://arxiv.org/abs/1602.01103

  15. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia—a large-scale, multilingual knowledge base extracted from wikipedia,” Semantic Web, vol. 6, no. 2, pp. 167–195, 2015.

  16. S. Raschka, “The state of LLM reasoning model inference: Inference-time compute scaling methods to improve reasoning models,” https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling, Mar. 2025, ahead of AI.

  17. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song et al., “Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025. [Online]. Available: https://arxiv.org/abs/2501.12948

  18. Y. Li, X. Hu, X. Qu, L. Li, and Y. Cheng, “Test-time preference optimization: On-the-fly alignment via iterative textual feedback,” arXiv preprint arXiv:2501.12895, 2025. [Online]. Available: https://arxiv.org/abs/2501.12895

  19. Y. Wang, Q. Liu, J. Xu, T. Liang, X. Chen, Z. He, L. Song, D. Yu, J. Li, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu, “Thoughts are all over the place: On the underthinking of o1-like LLMs,” arXiv preprint arXiv:2501.18585, 2025. [Online]. Available: https://arxiv.org/abs/2501.18585

  20. J. Pan, S. Deng, and S. Huang, “Coat: Chain-of-associated-thoughts framework for enhancing large language models reasoning,” arXiv preprint arXiv:2502.02390, 2025. [Online]. Available: https://arxiv.org/abs/2502.02390

  21. LangChain, “Langchain overview,” https://docs.langchain.com/oss/python/langchain/overview, accessed: 2025-12-29.

  22. “Langgraph overview,” https://docs.langchain.com/oss/python/langgraph/overview, accessed: 2025-12-29.

  23. “Graph api overview (langgraph),” https://docs.langchain.com/oss/python/langgraph/graph-api, accessed: 2025-12-29.

  24. DBpedia Association, “Sparql over online databases (dbpedia public endpoint documentation),” https://www.dbpedia.org/resources/sparql/, accessed: 2025-12-29.

  25. W3C SPARQL Working Group, “SPARQL 1.1 query language,” https://www.w3.org/TR/sparql11-query/, 2013, w3C Recommendation. Accessed: 2025-12-29.

  26. Tavily, “Tavily search api: Search endpoint,” https://docs.tavily.com/documentation/api-reference/endpoint/search, accessed: 2025-12-29.

  27. OpenAI, “Gpt-4o model (openai api documentation),” https://platform.openai.com/docs/models/gpt-4o, accessed: 2025-12-29.

  28. “Gpt-4o mini model (openai api documentation),” https://platform.openai.com/docs/models/gpt-4o-mini, accessed: 2025-12-29.

  29. “Pricing — openai api,” https://platform.openai.com/docs/pricing, accessed: 2025-12-29.

  30. “Web-artifact from claude.ai,” https://claude.ai/public/artifacts/0ecdfb83-807b-4481-8456-8605d48a356c, accessed: 2025-12-06.

  31. J. Howarth, “Number of parameters in gpt-4 (latest data),” https://explodingtopics.com/blog/gpt-parameters, 2025, last updated June 17, 2025; Accessed: 2025-12-06.