Abstract Summary
Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically retrieving and utilizing relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically retrieve and manage tools or MCP server contexts across multi-turn conversations, outperforming previous state-of-the-art tool retrieval approaches that lack multi-turn support and memory management of available tools or MCPs. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. We evaluate all modes across 13+ LLMs on the ScaleMCP benchmark, conducting experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency), task completion accuracy, and comprehensive cost analysis across modes. Our results significantly outperform existing state-of-the-art tool retrieval methods which cannot handle multi-turn tool retrieval and management. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90¨C94\% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0¨C60\%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs, cost analysis, and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.