MemGPT: Towards LLMs as Operating Systems-大模型长记忆解决方案
这是一篇来自伯克利大学的论文,主要针对大模型受限制的上下文,提供了一定解决方案,这里进行了翻译,方便做应用开发/agent开发等场景研发人员进行原理参考
Abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk. Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM’s limited context window. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM’s context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://research.memgpt.ai.
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis.
大型语言模型(LLMs)已经彻底改变了人工智能,但受限于有限的上下文窗口,这阻碍了它们在如扩展对话和文档分析等任务中的实用性。
To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk.
为了能够使用超出有限上下文窗口的上下文,我们提出了虚拟上下文管理,这是一种从传统操作系统中的分层内存系统汲取灵感的技术,这些系统通过物理内存和磁盘之间的分页提供扩展虚拟内存的幻觉。
Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM’s limited context window.
利用这种技术,我们介绍了MemGPT(MemoryGPT),这是一个智能管理系统,能够智能地管理不同的存储层次,以便在LLM有限的上下文窗口内有效地提供扩展上下文。
We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM’s context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users.
我们在两个领域评估了我们受操作系统启发的设计,其中现代LLMs的有限上下文窗口严重限制了它们的表现:文档分析,其中MemGPT能够分析远远超过底层LLM上下文窗口的大型文档;多会话聊天,其中MemGPT可以创建能够记住、反思并通过与用户的长期交互动态发展的对话代理。
We release MemGPT code and data for our experiments at https://research.memgpt.ai.
我们在 https://research.memgpt.ai 上发布了MemGPT代码和我们实验的数据。
1. Introduction
In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications. Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents. For example, the most widely used open-source LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).
In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications.
近年来,大型语言模型(LLMs)及其底层的变换器架构(Vaswani等人,2017年;Devlin等人,2018年;Brown等人,2020年;Ouyang等人,2022年)已成为对话式人工智能的基石,并导致了大量消费者和企业应用程序的出现。
Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents.
尽管取得了这些进步,LLMs使用的有限固定长度上下文窗口显著限制了它们在长对话或对长文档进行推理的应用能力。
For example, the most widely used open-source LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).
例如,最广泛使用的开源LLMs只能支持几十个来回消息,或者在超过其最大输入长度之前对短文档进行推理。
Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture’s self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020). While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that longcontext models struggle to utilize additional context effectively (Liu et al., 2023a). As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.
Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture’s self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020).
直接扩展变换器的上下文长度会导致计算时间和内存成本呈二次方增加,这是由于变换器架构的自注意力机制造成的,使得设计新的长上下文架构成为一个紧迫的研究挑战。
While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that longcontext models struggle to utilize additional context effectively (Liu et al., 2023a).
尽管开发更长的模型是一个活跃的研究领域,即使我们能够克服上下文扩展的计算挑战,最近的研究表明,长上下文模型在有效利用额外上下文方面存在困难。
As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.
因此,考虑到训练最先进LLMs所需的大量资源和上下文扩展的递减收益,迫切需要替代技术来支持长上下文。
In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models. Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk. We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management. Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user.
In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models.
在本文中,我们研究了如何在继续使用固定上下文模型的同时提供无限上下文的错觉。
Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk.
我们的方法借鉴了虚拟内存分页的概念,该概念被开发出来,以使应用程序能够在数据集远远超过可用内存的情况下工作,通过在主存和磁盘之间分页数据。
We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management.
我们利用LLM代理的功能调用能力的最新进展来设计MemGPT,这是一个受操作系统启发的LLM系统,用于虚拟上下文管理。
Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user.
使用函数调用,LLM代理可以读写外部数据源,修改自己的上下文,并选择何时向用户返回响应。
These capabilities allow LLMs to effective “page” in and out information between context windows (analogous to “main memory” in operating systems) and external storage, similar to hierarchical memory in traditional OSes. In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions. This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.
These capabilities allow LLMs to effectively "page" in and out information between context windows (analogous to "main memory" in operating systems) and external storage, similar to hierarchical memory in traditional OSes.
这些能力允许LLMs在上下文窗口(类似于操作系统中的“主存”)和外部存储之间有效地“分页”进出信息,类似于传统操作系统中的分层内存。
In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions.
此外,函数调用可以用来管理上下文管理、响应生成和用户交互之间的控制流程。
This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.
这允许一个代理选择为其单一任务迭代修改其上下文中的内容,从而更有效地利用其有限的上下文。
In MemGPT, we treat context windows as a constrained memory resource, and design a memory hiearchy for LLMs analogous to memory tiers used in traditional OSes (Patterson et al., 1988). Applications in traditional OSes interact with virtual memory, which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications. To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an ‘LLM OS’, which we call MemGPT. MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems. Figure 3 illustrates the components of MemGPT. The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows. To demonstrate the utility of our new OSinspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context: document analysis, where the length of standard text files can quickly exceed the input capacity of modern LLMs, and conversational agents, where LLMs bound by limited conversation windows lack context awareness, persona consistency, and long-term memory during extended conversations. In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.
In MemGPT, we treat context windows as a constrained memory resource, and design a memory hierarchy for LLMs analogous to memory tiers used in traditional OSes.
在MemGPT中,我们将上下文窗口视为受限的内存资源,并为LLMs设计了类似于传统操作系统中使用的内存层次结构。
Applications in traditional OSes interact with virtual memory, which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications.
应用程序在传统操作系统中与虚拟内存交互,虚拟内存通过将溢出数据分页到磁盘,并在应用程序访问时通过页面错误将数据重新检索回内存,从而提供比物理(即主)内存中实际可用的更多的内存资源的错觉。
To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an ‘LLM OS’, which we call MemGPT.
为了提供类似的更长上下文长度的错觉(类似于虚拟内存),我们允许LLM通过我们称之为MemGPT的“LLM OS”来管理其自己的上下文中放置的内容(类似于物理内存)。
MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems.
MemGPT使LLM能够检索缺失于上下文中的相关历史数据,并且将不太相关的数据从上下文逐出到外部存储系统。
The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows.
内存层次结构、操作系统功能和基于事件的控制流程的结合使用,使MemGPT能够使用具有有限上下文窗口的LLMs处理无界上下文。
To demonstrate the utility of our new OS-inspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context.
为了展示我们新的受操作系统启发的LLM系统的功能,我们在两个领域评估了MemGPT,现有LLMs的性能受到有限上下文的严重限制。
In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.
在这两种情况下,MemGPT都能够克服有限上下文的限制,超越现有的基于LLMs的方法。
2. MemGPT (MemoryGPT)
MemGPT’s OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage). Main context consists of the LLM prompt tokens— anything in main context is considered in-context and can be accessed by the LLM processor during inference. External context refers to any information that is held outside of the LLMs fixed context window. This out-of-context data must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference. MemGPT provides function calls that the LLM processor to manage its own memory without any user intervention
MemGPT’s OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage).
MemGPT受操作系统启发的多级内存架构区分了两种主要的内存类型:主上下文(类似于主存/物理内存/RAM)和外部上下文(类似于磁盘内存/磁盘存储)。
Main context consists of the LLM prompt tokens— anything in main context is considered in-context and can be accessed by the LLM processor during inference.
主上下文由LLM提示令牌组成——任何在主上下文中的内容都被视为上下文内,并可以在推理期间被LLM处理器访问。
External context refers to any information that is held outside of the LLMs fixed context window.
外部上下文指的是存储在LLMs固定上下文窗口之外的任何信息。
This out-of-context data must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference.
这些上下文外的数据必须明确移动到主上下文,以便在推理期间传递给LLM处理器。
MemGPT provides function calls that the LLM processor to manage its own memory without any user intervention.
MemGPT提供了LLM处理器用来管理其自身内存的函数调用,无需任何用户干预。
2.1. Main context (prompt tokens)
The prompt tokens in MemGPT are split into three contiguous sections: the system instructions, working context, and FIFO Queue. The system instructions are readonly (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data). Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls. In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user. The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs. The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue
The prompt tokens in MemGPT are split into three contiguous sections: the system instructions, working context, and FIFO Queue.
MemGPT中的提示令牌被分成三个连续的部分:系统指令、工作上下文和FIFO队列。
The system instructions are readonly (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data).
系统指令是只读的(静态的),并且包含有关MemGPT控制流程、不同内存层次的预期用途以及如何使用MemGPT功能的信息。
Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls.
工作上下文是一个固定大小的读写块,只能通过MemGPT函数调用来写入。
In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user.
在对话设置中,工作上下文旨在用于存储有关用户和代理采纳的角色的关键事实、偏好和其他重要信息,允许代理与用户流利地对话。
The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs.
FIFO队列存储消息的滚动历史记录,包括代理和用户之间的消息,以及系统消息和函数调用的输入输出。
The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue.
FIFO队列的第一个索引存储包含已从队列中逐出的消息的递归摘要的系统消息。
2.2. Queue Manager
The queue manager manages messages in recall storage and the FIFO queue. When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens). The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database). When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of the queue to reinsert them into the LLM’s context window.
The queue manager manages messages in recall storage and the FIFO queue.
队列管理器负责管理回忆存储中的消息和FIFO队列。
When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens).
当系统收到新消息时,队列管理器会将传入的消息追加到FIFO队列,连接提示令牌,并触发LLM推理以生成LLM输出(完成令牌)。
The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database).
队列管理器将传入的消息和生成的LLM输出都写入回忆存储(MemGPT消息数据库)。
When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of the queue to reinsert them into the LLM’s context window.
当通过MemGPT函数调用语回忆存储中的消息时,队列管理器会将它们追加到队列的末尾,以重新将它们插入到LLM的上下文窗口中。
The queue manager is also responsible for controlling context overflow via a queue eviction policy. When the prompt tokens exceed the ‘warning token count‘ of the underlying LLM’s context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a ‘memory pressure‘ warning) to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects). When the prompt tokens exceed the ‘flush token count’ (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window), generates a new recursive summary using the existing recursive summary and evicted messages. Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.
The queue manager is also responsible for controlling context overflow via a queue eviction policy.
队列管理器还负责通过队列逐出策略控制上下文溢出。
When the prompt tokens exceed the ‘warning token count‘ of the underlying LLM’s context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a ‘memory pressure‘ warning).
当提示令牌超过底层LLM的上下文窗口的“警告令牌计数”(例如,上下文窗口的70%),队列管理器会向队列中插入一个系统消息,警告LLM即将进行队列逐出(一个“内存压力”警告)。
to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects).
以便LLM可以使用MemGPT函数将FIFO队列中包含的重要信息存储到工作上下文或归档存储中(这是一个可读写的数据库,用于存储任意长度的文本对象)。
When the prompt tokens exceed the ‘flush token count’ (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window),
当提示令牌超过“刷新令牌计数”(例如,上下文窗口的100%),队列管理器会刷新队列以释放上下文窗口中的空间:队列管理器逐出特定数量的消息(例如,上下文窗口的50%),
generates a new recursive summary using the existing recursive summary and evicted messages.
并使用现有的递归摘要和逐出的消息生成一个新的递归摘要。
Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.
一旦队列被刷新,被逐出的消息就不再处于上下文中,且立即对LLM不可见,但它们被无限期地存储在回忆存储中,并且可以通过MemGPT函数调用来读取。
2.3. Function executor (handling of completion tokens)
MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context. For instance, it can decide when to move items between contexts (e.g. when the conversation history is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3). We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems. These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory
MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor.
MemGPT通过由LLM处理器生成的函数调用来协调主上下文和外部上下文之间的数据移动。
Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context.
内存编辑和检索完全自主导向:MemGPT根据当前上下文自主更新和搜索自己的内存。
For instance, it can decide when to move items between contexts (e.g. when the conversation history is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3).
例如,它可以决定何时在上下文之间移动项目(如图1所示,当对话历史变得太长时),并修改其主上下文以更好地反映其对当前目标和责任的不断演变的理解(如图3所示)。
We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems.
我们通过在系统指令中提供明确的指令来实现自主编辑和检索,这些指令指导LLM如何与MemGPT内存系统交互。
These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory.
这些指令包括两个主要部分:(1)内存层次结构及其各自用途的详细描述,以及(2)系统可以调用的函数模式(包括它们的自然语言描述),以访问或修改其内存。
During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string. This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed. The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT. This feedback loop enables the system to learn from its actions and adjust its behavior accordingly. Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions. Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.
During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string.
在每个推理周期中,LLM处理器将主上下文(串联成一个单一字符串)作为输入,并生成一个输出字符串。
This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed.
这个输出字符串由MemGPT解析以确保正确性,如果解析器验证了函数参数,则执行该函数。
The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT.
然后,结果(包括发生的任何运行时错误,例如尝试在主上下文已达到最大容量时添加内容)由MemGPT反馈给处理器。
This feedback loop enables the system to learn from its actions and adjust its behavior accordingly.
这个反馈循环使系统能够从其行动中学习并相应调整其行为。
Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions.
对上下文限制的认识是使自编辑机制有效工作的关键方面,为此,MemGPT通过有关令牌限制的警告提示处理器,以指导其内存管理决策。
Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.
此外,我们的内存检索机制被设计为意识到这些令牌限制,并实现分页以防止检索调用溢出上下文窗口。
2.4. Control flow and function chaining
In MemGPT, eventstrigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run ‘unprompted’ without user intervention). MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.
In MemGPT, events trigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run ‘unprompted’ without user intervention).
在MemGPT中,事件触发LLM推理:事件是MemGPT的通用输入,可以包括用户消息(在聊天应用中)、系统消息(例如,主上下文容量警告)、用户交互(例如,提醒用户刚刚登录的通知,或者他们完成了文件上传的通知),以及定期运行的定时事件(允许MemGPT在没有用户干预的情况下“自发”运行)。
MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.
MemGPT使用解析器处理事件,将它们转换成可以添加到主上下文并最终作为输入输入到LLM处理器的纯文本消息。
Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries. Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user. In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution. If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution). If this flag is not present (a yield), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).
Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries.
许多实际任务需要顺序调用多个函数,例如,通过单一查询浏览多个结果页面,或者从主上下文中的不同文档中整理来自不同查询的数据。
Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user.
函数链式调用允许MemGPT在将控制权交还给用户之前,顺序执行多个函数调用。
In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution.
在MemGPT中,可以用一个特殊标志调用函数,该标志请求在请求的函数完成执行后立即将控制权交回给处理器。
If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution).
如果存在这个标志,MemGPT将把函数输出添加到主上下文,并且(与暂停处理器执行相反)。
If this flag is not present (a yield), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).
如果这个标志不存在(即yield),在下一个外部事件触发(例如用户消息或预定中断)之前,MemGPT将不会运行LLM处理器。
3. Experiments
We assess MemGPT in two long-context domains: conversational agents and document analysis. For conversational agents, we expand the existing Multi-Session Chat dataset (Xu et al., 2021) and introduce two new dialogue tasks that evaluate an agent’s ability to retain knowledge across long conversations. For document analysis, we benchmark MemGPT on existing tasks from (Liu et al., 2023a) for question answering and key-value retrieval over lengthy documents. We also propose a new nested keyvalue retrieval task requiring collating information across multiple data sources, which tests the ability of an agent to collate information from multiple data sources (multihop retrieval). We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a dataset of embeddings for 20M Wikipedia articles to facilitate future research. Our code for the benchmarks is available at https://research.memgpt.ai.
We assess MemGPT in two long-context domains: conversational agents and document analysis. 我们对MemGPT在两个长上下文领域进行了评估:对话代理和文档分析。
For conversational agents, we expand the existing Multi-Session Chat dataset (Xu et al., 2021) and introduce two new dialogue tasks that evaluate an agent’s ability to retain knowledge across long conversations.
对于对话代理,我们扩展了现有的多会话聊天数据集,并引入了两个新的对话任务,以评估代理在长对话中保持知识的能力。
For document analysis, we benchmark MemGPT on existing tasks from (Liu et al., 2023a) for question answering and key-value retrieval over lengthy documents.
对于文档分析,我们在(Liu等人,2023a)的现有任务中对MemGPT进行了基准测试,这些任务涉及对长文档的问题回答和键值检索。
We also propose a new nested keyvalue retrieval task requiring collating information across multiple data sources, which tests the ability of an agent to collate information from multiple data sources (multihop retrieval).
我们还提出了一个新的嵌套键值检索任务,该任务需要从多个数据源整理信息,这测试了代理从多个数据源整理信息的能力(多跳检索)。
We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a dataset of embeddings for 20M Wikipedia articles to facilitate future research.
我们公开发布了我们的增强型多会话聊天数据集、嵌套键值检索数据集以及2000万维基百科文章的嵌入数据集,以促进未来的研究。
Our code for the benchmarks is available at https://research.memgpt.ai.
我们的基准测试代码可在 https://research.memgpt.ai 上获取。
Implementation details. When discussing OpenAI models, unless otherwise specified ‘GPT-4 Turbo’ refers to the specific gpt-4-1106-preview model endpoint (context window of 128, 000), ‘GPT-4‘ refers to gpt-4-0613 (context window of 8, 192), and ‘GPT-3.5 Turbo‘ refers to gpt-3.5-turbo-1106 (context window of 16, 385). In experiments, we run MemGPT with all baseline models (GPT-4, GPT-4 Turbo, and GPT 3.5) to show how the underlying model performance affects MemGPT’s.
Implementation details.
实现细节。
When discussing OpenAI models, unless otherwise specified ‘GPT-4 Turbo’ refers to the specific gpt-4-1106-preview model endpoint (context window of 128, 000), ‘GPT-4‘ refers to gpt-4-0613 (context window of 8, 192), and ‘GPT-3.5 Turbo‘ refers to gpt-3.5-turbo-1106 (context window of 16, 385).
在讨论OpenAI模型时,除非另有说明,“GPT-4 Turbo”指的是特定的gpt-4-1106-preview模型端点(上下文窗口为128,000),“GPT-4”指的是gpt-4-0613(上下文窗口为8,192),“GPT-3.5 Turbo”指的是gpt-3.5-turbo-1106(上下文窗口为16,385)。
In experiments, we run MemGPT with all baseline models (GPT-4, GPT-4 Turbo, and GPT 3.5) to show how the underlying model performance affects MemGPT’s.
在实验中,我们使用所有基线模型(GPT-4、GPT-4 Turbo和GPT 3.5)运行MemGPT,以展示底层模型性能如何影响MemGPT的。
3.1. MemGPT for conversational agents
Conversational agents like virtual companions and personalized assistants aim to engage users in natural, long-term interactions, potentially spanning weeks, months, or even years. This creates challenges for models with fixed-length contexts, which can only reference a limited history of the conversation. An ‘infinite context’ agent should seamlessly handle continuous exchanges without boundary or reset. When conversing with a user, such an agent must satisfy two key criteria: (1) Consistency – The agent should maintain conversational coherence. New facts, preferences, and events mentioned should align with prior statements from both the user and agent. (2) Engagement – The agent should draw on long-term knowledge about the user to personalize responses. Referencing prior conversations makes dialogue more natural and engaging.
Conversational agents like virtual companions and personalized assistants aim to engage users in natural, long-term interactions, potentially spanning weeks, months, or even years.
对话代理,如虚拟伴侣和个性化助手,旨在与用户进行自然、长期的互动,可能跨越数周、数月甚至数年。
This creates challenges for models with fixed-length contexts, which can only reference a limited history of the conversation.
这对固定长度上下文的模型构成了挑战,因为它们只能引用有限的对话历史。
An ‘infinite context’ agent should seamlessly handle continuous exchanges without boundary or reset.
一个“无限上下文”代理应该能够无缝处理连续的交流,没有界限或重置。
When conversing with a user, such an agent must satisfy two key criteria:
与用户对话时,这样的代理必须满足两个关键标准:
(1) Consistency - The agent should maintain conversational coherence. New facts, preferences, and events mentioned should align with prior statements from both the user and agent.
(1) 一致性 - 代理应保持对话的连贯性。新提到的事实、偏好和事件应与用户和代理之前的陈述一致。
(2) Engagement - The agent should draw on long-term knowledge about the user to personalize responses. Referencing prior conversations makes dialogue more natural and engaging.
(2) 参与度 - 代理应利用对用户的长期了解来个性化回应。引用之前的对话可以使对话更加自然和吸引人。
We therefore assess our proposed system, MemGPT, on these two criteria: (1) Does MemGPT leverage its memory to improve conversation consistency? Can it remember relevant facts, preferences, and events from past interactions to maintain coherence? (2) Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it spontaneously incorporate long-range user information to personalize messages? By evaluating on consistency and engagement, we can determine how well MemGPT handles the challenges of long-term conversational interaction compared to fixed-context baselines. Its ability to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits for conversational agents.
We therefore assess our proposed system, MemGPT, on these two criteria:
因此,我们根据这两个标准评估我们提出的系统MemGPT:
(1) Does MemGPT leverage its memory to improve conversation consistency? Can it remember relevant facts, preferences, and events from past interactions to maintain coherence?
(1) MemGPT是否利用其记忆来提高对话一致性?它能否记住过去互动中相关的事实、偏好和事件以保持连贯性?
(2) Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it spontaneously incorporate long-range user information to personalize messages?
(2) MemGPT是否通过利用记忆产生更具吸引力的对话?它是否能够自发地整合长期用户信息来个性化消息?
By evaluating on consistency and engagement, we can determine how well MemGPT handles the challenges of long-term conversational interaction compared to fixed-context baselines.
通过评估一致性和参与度,我们可以确定MemGPT与固定上下文基线相比,处理长期对话互动的挑战能力如何。
Its ability to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits for conversational agents.
它满足这些标准的能力将展示无界上下文是否为对话代理提供了有意义的好处。
Dataset. We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers, each of whom was asked to play a consistent persona for the duration of all sessions. Each multi-session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages. As part of our consistency experiments, we created a new session (session 6) that contains a single questionanswer response pair between the same two personas.
Dataset. We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers, each of whom was asked to play a consistent persona for the duration of all sessions.
数据集。我们在Xu等人(2021年)引入的多会话聊天(MSC)数据集上评估MemGPT和我们的固定上下文基线,该数据集包含由人类标注者生成的多会话聊天记录,每位标注者都被要求在所有会话中扮演一个一致的角色。
Each multi-session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages.
MSC中的每一组多会话聊天都有五个会话,每个会话大约包含十几个消息。
As part of our consistency experiments, we created a new session (session 6) that contains a single question-answer response pair between the same two personas.
作为我们一致性实验的一部分,我们创建了一个新的会话(会话6),其中包含同两个角色之间的单个问题-回答对。
3.1.1. DEEP MEMORY RETRIEVAL TASK (CONSISTENCY).
深度记忆检索任务(一致性)
We introduce a new ‘deep memory retrieval’ (DMR) task based on the MSC dataset designed to test the consistency of a conversational agent. In DMR, the conversational agent is asked a question by the user that explicitly refers back to a prior conversation and has a very narrow expected answer range. We generated the DMR question-answer (QA) pairs using a separate LLM that was instructed to write a question from one user to another that could only be answered correctly using knowledge gained from the past sessions (see Appendix for further details).
We introduce a new ‘deep memory retrieval’ (DMR) task based on the MSC dataset designed to test the consistency of a conversational agent.
我们引入了一个新的“深度记忆检索”(DMR)任务,基于MSC数据集设计,旨在测试对话代理的一致性。
In DMR, the conversational agent is asked a question by the user that explicitly refers back to a prior conversation and has a very narrow expected answer range.
在DMR中,用户向对话代理提出一个问题,该问题明确回顾了之前的对话,并且期望的答案范围非常狭窄。
We generated the DMR question-answer (QA) pairs using a separate LLM that was instructed to write a question from one user to another that could only be answered correctly using knowledge gained from the past sessions.
我们使用一个单独的大型语言模型(LLM)生成了DMR问题-回答(QA)对,该模型被指示编写一个问题,从一位用户到另一位用户,只有利用从过去会话中获得的知识才能正确回答。
We evaluate the quality of the generated response against the ‘gold response’ using ROUGE-L scores (Lin, 2004) and an ‘LLM judge’, which is instructed to evaluate whether or not the generated response is consistent with the gold response (GPT-4 has been shown to have high agreement with human evaluators (Zheng et al., 2023)). In practice, we notice that the generated responses (from both MemGPT and the baselines) were generally more verbose than the gold responses. We use the ROUGE-L recall (R) metric to account for the verbosity of the generated agent replies compared to the relatively short gold answer labels.
We evaluate the quality of the generated response against the ‘gold response’ using ROUGE-L scores (Lin, 2004) and an ‘LLM judge’,
我们使用ROUGE-L分数和一位“LLM裁判”来评估生成回应的质量与“标准回应”的对比,
which is instructed to evaluate whether or not the generated response is consistent with the gold response (GPT-4 has been shown to have high agreement with human evaluators (Zheng et al., 2023)).
该裁判被指导评估生成的回应是否与标准回应一致(GPT-4已被证明与人类评估者具有高度一致性)。
In practice, we notice that the generated responses (from both MemGPT and the baselines) were generally more verbose than the gold responses.
在实践中,我们注意到生成的回应(来自MemGPT和基线模型)通常比标准回应更加冗长。
We use the ROUGE-L recall (R) metric to account for the verbosity of the generated agent replies compared to the relatively short gold answer labels.
我们使用ROUGE-L回忆(R)指标来考虑生成代理回应的冗长与相对简短的标准答案标签相比。
MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT vs the fixedmemory baselines. We compare MemGPT using different underlying LLMs, and compare against using the base LLM without MemGPT as a baseline. The baselines are able to see a lossy summarization of the past five conversations to mimic an extended recursive summarization procedure, while MemGPT instead has access to the full conversation history but must access it via paginated search queries to recall memory (in order to bring them into main context). In this task, we see that MemGPT clearly improves the performance of the underlying base LLM: there is a clear drop in both accuracy and ROUGE scores when going from MemGPT to the corresponding LLM baselines.
MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT vs the fixed-memory baselines.
MemGPT利用记忆来维持连贯性:表2显示了MemGPT与固定内存基线的性能对比。
We compare MemGPT using different underlying LLMs, and compare against using the base LLM without MemGPT as a baseline.
我们比较了使用不同底层LLMs的MemGPT,并将其与没有MemGPT的基础LLM作为基线进行比较。
The baselines are able to see a lossy summarization of the past five conversations to mimic an extended recursive summarization procedure, while MemGPT instead has access to the full conversation history but must access it via paginated search queries to recall memory (in order to bring them into main context).
基线能够看到一个过去五次对话的有损摘要,以模仿扩展的递归摘要过程,而MemGPT则可以访问完整的对话历史,但必须通过分页搜索查询来访问记忆(以便将它们带入主上下文)。
In this task, we see that MemGPT clearly improves the performance of the underlying base LLM: there is a clear drop in both accuracy and ROUGE scores when going from MemGPT to the corresponding LLM baselines.
在这项任务中,我们发现MemGPT明显提高了底层基础LLM的性能:当从MemGPT转向相应的LLM基线时,准确性和ROUGE分数都有明显的下降。
3.1.2. CONVERSATION OPENER TASK (ENGAGEMENT).
对话开场任务(参与度)
In the ‘conversation opener’ task we evaluate an agent’s ability to craft engaging messages to the user that draw from knowledge accumulated in prior conversations. To evaluate the ‘engagingness’ of a conversation opener using the MSC dataset, we compare the generated opener to the gold personas: an engaging conversation opener should draw from one (or several) of the data points contained in the persona, which in MSC effectively summarize the knowledge accumulated throughout all prior sessions. We also compare to the human-generated gold opener, i.e., the first response in the following session. We report the CSIM scores of MemGPT’s openers in Table 3. We test several variations of MemGPT using different base LLMs.
In the ‘conversation opener’ task we evaluate an agent’s ability to craft engaging messages to the user that draw from knowledge accumulated in prior conversations.
在“对话开场”任务中,我们评估代理根据之前对话中积累的知识,制作吸引用户的个性化消息的能力。
To evaluate the ‘engagingness’ of a conversation opener using the MSC dataset, we compare the generated opener to the gold personas: an engaging conversation opener should draw from one (or several) of the data points contained in the persona, which in MSC effectively summarize the knowledge accumulated throughout all prior sessions.
为了使用MSC数据集评估对话开场的“吸引力”,我们将生成的开场与标准角色进行比较:一个有吸引力的对话开场应该利用角色中包含的一个(或几个)数据点,在MSC中,这有效地总结了所有之前会话中积累的知识。
We also compare to the human-generated gold opener, i.e., the first response in the following session.
我们还将其与人类生成的标准开场进行比较,即下一场会话中的第一条回应。
We report the CSIM scores of MemGPT’s openers in Table 3.
我们在表3中报告了MemGPT开场的CSM分数。
We test several variations of MemGPT using different base LLMs.
我们测试了使用不同基础LLMs的几个MemGPT变体。
MemGPT utilizes memory to increase engagement: As seen in Table 3, MemGPT is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written human openers. We observe that MemGPT tends to craft openers that are both more verbose and cover more aspects of the persona information than the human baseline. Additionally, we can see the storing information in working context is key to generating engaging openers.
MemGPT utilizes memory to increase engagement: As seen in Table 3, MemGPT is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written human openers.
MemGPT利用记忆来提高参与度:如表3所示,MemGPT能够制作同样吸引人甚至有时超越人工编写的开场白。
We observe that MemGPT tends to craft openers that are both more verbose and cover more aspects of the persona information than the human baseline.
我们观察到MemGPT倾向于制作比人类基线更加冗长并且涵盖角色信息更多方面的开场白。
Additionally, we can see the storing information in working context is key to generating engaging openers.
此外,我们可以看到在工作上下文中存储信息对于生成吸引人的开场白至关重要。
3.2. MemGPT for document analysis
Document analysis also faces challenges due to the limited context windows of today’s transformer models. As shown in Table 1, both open and closed source models suffer from constrained context length (up to 128k tokens for OpenAI’s models). However many documents easily surpass these lengths; for example, legal or financial documents such as Annual Reports (SEC Form 10-K) can easily pass the million token mark. Moreover, many real document analysis tasks require drawing connections across multiple such lengthy documents. Anticipating these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-context problem. Recent research (Liu et al., 2023a) also raises doubts about the utility of simply scaling contexts, since they find uneven attention distributions in large context models (the model is more capable of recalling information at the beginning or end of its context window, vs tokens in the middle). To enable reasoning across documents, more flexible memory architectures like MemGPT are needed.
Document analysis also faces challenges due to the limited context windows of today’s transformer models.
文档分析也面临着当今变换器模型有限上下文窗口的挑战。
As shown in Table 1, both open and closed source models suffer from constrained context length (up to 128k tokens for OpenAI’s models).
如表1所示,无论是开源还是闭源模型,都受到上下文长度限制(OpenAI模型最多可达128k个令牌)。
However many documents easily surpass these lengths; for example, legal or financial documents such as Annual Reports (SEC Form 10-K) can easily pass the million token mark.
然而,许多文档很容易超过这些长度;例如,法律或财务文件,如年度报告(SEC表格10-K),很容易超过百万个令牌的标记。
Moreover, many real document analysis tasks require drawing connections across multiple such lengthy documents.
此外,许多真实的文档分析任务需要在多个这样冗长的文档之间建立联系。
Anticipating these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-context problem.
面对这些情况,盲目扩展上下文作为解决固定上下文问题的方法变得难以想象。
Recent research (Liu et al., 2023a) also raises doubts about the utility of simply scaling contexts, since they find uneven attention distributions in large context models (the model is more capable of recalling information at the beginning or end of its context window, vs tokens in the middle).
最近的研究(Liu等人,2023a)也对简单地扩展上下文的效用提出了疑问,因为他们发现大型上下文模型中的注意力分布不均(模型更能够回忆起在其上下文窗口开始或结束时的信息,而不是中间的令牌)。
To enable reasoning across documents, more flexible memory architectures like MemGPT are needed.
为了实现跨文档的推理,需要像MemGPT这样更灵活的内存架构。
3.2.1. MULTI-DOCUMENT QUESTION-ANSWERING.
To evaluate MemGPT’s ability to analyze documents, we benchmark MemGPT against fixed-context baselines on the retriever-reader document QA task from Liu et al. (2023a). In this task, a question is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia documents for the question. A reader model (the LLM) is then fed these documents as input, and is asked to use the provided documents to answer the question. Similar to Liu et al. (2023a), we evaluate reader accuracy as the number of retrieved documents K increases.
To evaluate MemGPT’s ability to analyze documents, we benchmark MemGPT against fixed-context baselines on the retriever-reader document QA task from Liu et al. (2023a).
为了评估MemGPT分析文档的能力,我们将MemGPT与固定上下文基线在Liu等人(2023a)提出的检索器-阅读器文档问答任务上进行基准测试。
In this task, a question is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia documents for the question.
在这个任务中,从NaturalQuestions-Open数据集中选取一个问题,然后检索器为这个问题选择相关的维基百科文档。
A reader model (the LLM) is then fed these documents as input, and is asked to use the provided documents to answer the question.
然后,阅读器模型(LLM)被输入这些文档,并被要求使用提供的文档回答问题。
Similar to Liu et al. (2023a), we evaluate reader accuracy as the number of retrieved documents K increases.
与Liu等人(2023a)类似,我们评估随着检索到的文档数量K的增加,阅读器的准确性。
In our evaluation setup, both the fixed-context baselines and MemGPT use the same retriever, which selects the top K documents according using similarity search (cosine distance) on OpenAI’s text-embedding-ada-002 embeddings. We use MemGPT’s default storage
文章来源:
Author:司马他
link:https://www.congcong.us/post/memgpt-towards-llms-as-operating-systems.html