Abstract: In this paper, we present CAST-Eval, a novel, comprehensive and domain-specific benchmark designed to assess the knowledge and reasoning capabilities of large language models (LLMs) in the ...
A developer-targeting campaign leveraged malicious Next.js repositories to trigger a covert RCE-to-C2 chain through standard ...
If you have any questions about the code or the paper, feel free to contact Zhiyuan Zeng (zhiyuan1zeng@gmail.com or zyzeng@cs.washington.edu). If you encounter any issues while using the code or want ...
Anthropic 刚推出 Skills [1]时,我非常兴奋。官方的态度也很明确:不要再执着于开发复杂 Agent,而是把精力放在 Skills 上。但在认真研究了一圈官方和社区的 Skills 示例[2]后,我很快冷静下来—— 几乎没有一个 Skills 能直接在真实环境中跑起来。 当时我的判断是:这就是个玩具。直到最近,Claude Code 2.1.3 ...