AI & ML Open Release

The first large-scale benchmark for LLM agents based on years of authentic, cross-domain user behavioral data rather than synthetic personas.

March 30, 2026

Original Paper

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu

arXiv · 2603.25973

The Takeaway

Most 'long memory' benchmarks are based on toy tasks or short sessions. MemoryCD provides a realistic testbed for lifelong personalization, revealing that current state-of-the-art models fail to maintain user context across disparate domains (e.g., shopping vs. healthcare) over time.

From the abstract

Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic