Same Content, Different Representations: A Controlled Study for Table QA
Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data