cs.CL 件の論文 | Gist.Science

Interactive Benchmarks

この論文は、飽和や主観性などの問題を抱える従来のベンチマークに代わり、予算制約下での対話的プロセスを通じてモデルの推論能力を評価する「Interactive Benchmarks」という新たな枠組みを提案し、論理・数学の証明や戦略的ゲームにおける実験を通じて、対話的シナリオにおけるモデルの知能評価の重要性と改善余地を明らかにしています。

Baoqing Yue, Zihan Zhu, Yifan Zhang + 3 more2026-03-06💻 cs

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

この論文は、既存の評価手法の限界を克服し、複数の回答をリスト形式で評価する新たなメタ評価ベンチマーク「IF-RewardBench」を提案し、それが下流タスクのパフォーマンスとより強い相関を示すことを実証しています。

Bosi Wen, Yilin Niu, Cunxiang Wang + 5 more2026-03-06💻 cs

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

この論文は、統計的データ分布情報を関数メタデータと融合させる軽量な検索モデル「DARE」と大規模な R パッケージ知識ベース「RPKB」を提案し、R 生態系における LLM エージェントのコード生成精度と統計分析タスクの成功率を大幅に向上させることを示しています。

Maojun Sun, Yue Wu, Yifei Xie + 5 more2026-03-06💻 cs

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

本論文は、長期的な旅行計画における制約条件の遵守と並列実行を可能にする階層型マルチエージェントフレームワーク「HiMAP-Travel」を提案し、TravelPlanner ベンチマークにおいて既存の手法を大幅に上回る性能を達成したことを示しています。

The Viet Bui, Wenjun Li, Yong Liu2026-03-06💻 cs

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

本論文は、単一の LLM レイヤーを圧縮器とデコーダーとしてスタックし、マルチスケールの自己注入と木構造に基づく効率的な情報取得を実現することで、8K トークンの学習データから 128K トークンを超える長文脈を高精度かつ高効率に処理する新たなフレームワーク「SharedLLM」を提案しています。

Wei Han, Pan Zhou, Shuicheng Yan2026-03-06💻 cs

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

本論文は、マルチモーダル大規模言語モデルを汎用埋め込みモデルへ適応させる際に生じるタスク間の競合を、専門家の混合（MoE）と低ランク適応（LoRA）の組み合わせ、および専門家の活性化パターンを利用した新しい負のサンプリング手法（EANS）によって解決し、MMEB ベンチマークおよび実世界の産業データセットにおいて最先端の性能を達成する TSEmbed というフレームワークを提案しています。

Yebo Wu, Feng Liu, Ziwei Xie + 4 more2026-03-06💻 cs

Privacy-Aware Camera 2.0 Technical Report

本論文は、エッジデバイスで生画像を数学的に不可逆な抽象特徴ベクトルに変換し、クラウドで動的輪郭言語を用いて行動認識と意味的再構成を行う「AI Flow」パラダイムに基づく新たなプライバシー保護知覚フレームワーク「Privacy-Aware Camera 2.0」を提案し、プライバシー保護と証拠能力の両立を実現するものである。

Huan Song, Shuyu Tian, Ting Long + 5 more2026-03-06💻 cs

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

本論文は、LLM が多ターン対話で過去の推論に固執する「文脈的慣性」の問題を、単一ターンでの優れた推論能力を報酬の基準（アンカー）として活用する強化学習手法「RLSTA」により解決し、外部検証器なしでも安定した対話と分野横断的な汎化性能を実現することを提案しています。

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo + 1 more2026-03-06💻 cs

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

本論文は、大規模言語モデル（LLM）を用いた意味フィルタリングにおける逐次評価の非効率性を克服するため、クラスタリング・サンプリング・投票（CSV）という新しいフレームワークを提案し、LLM 呼び出し回数を部分線形に削減しながら高い精度を維持する手法を確立したものである。

Nan Hou, Kangfei Zhao, Jiadong Xie + 1 more2026-03-06💻 cs

Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

本論文は、大規模言語モデルにおける位置関係の符号化とセマンティック埋め込みを分離し、ニュートンの万有引力の法則と実証的に整合する「アテンション重力場（AGF）」という概念を導入することで、モデルの最適化と解釈可能性の向上を実現したことを示しています。

Edward Zhang2026-03-06💻 cs

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

本論文は、長文脈 LLM と Mem0 などの事実ベースの記憶システムを比較し、長文脈モデルは事実想起で優位だが、記憶システムはペルソナ一貫性で競争力があり、かつ対話回数が一定を超えるとコスト面で優位になるという精度とコストのトレードオフを明らかにした。

Natchanon Pollertlam, Witchayut Kornsuwannawit2026-03-06💬 cs.CL

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

本論文は、890 の研究結果をメタ分析し、自動短回答採点における LLM の限界（難易度との非相関、デコーダ型とエンコーダ型の性能差、トークナイザーの限界、および教育現場における人種的バイアスなど）を明らかにし、より適切なシステム設計の必要性を提言するものである。

Michael Hardy2026-03-06💬 cs.CL

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

本論文は、LLM の学習過程における「未知から既知」への遷移に伴う勾配挙動の系統的差異に着目し、FFN や Attention モジュールにおける勾配プロファイルに基づく軽量分類器「GDS」を提案することで、既存手法の限界を克服し、高い転移性と性能を実現する事前学習データ検出手法を開発した。

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang + 2 more2026-03-06💬 cs.CL

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

この論文は、発話生成における脳、筋肉、および構音器官の動きを同時に捉えるため、リアルタイム MRI、EEG、表面筋電図の同時取得と、それらの相互干渉を抑制する新しいアーチファクト除去パイプラインを提案するものである。

Jihwan Lee, Parsa Razmara, Kevin Huang + 16 more2026-03-06🤖 cs.AI

← 前へ次へ →

cs.CL

Interactive Benchmarks

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Privacy-Aware Camera 2.0 Technical Report

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

Why Is RLHF Alignment Shallow? A Gradient Analysis

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Free Lunch for Pass@ $k$ ? Low Cost Diverse Sampling for Diffusion Language Models

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

cs.CL

Interactive Benchmarks

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Privacy-Aware Camera 2.0 Technical Report

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

Why Is RLHF Alignment Shallow? A Gradient Analysis

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Free Lunch for Pass@kkk? Low Cost Diverse Sampling for Diffusion Language Models

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Free Lunch for Pass@ $k$ ? Low Cost Diverse Sampling for Diffusion Language Models