Eval JavaScript - 搜索 News

CAST-Eval: A Domain-Specific Benchmark for Large Language Models in Civil Aviation Safety

Abstract: In this paper, we present CAST-Eval, a novel, comprehensive and domain-specific benchmark designed to assess the knowledge and reasoning capabilities of large language models (LLMs) in the ...

Microsoft

Developer-targeting campaign using malicious Next.js repositories

A developer-targeting campaign leveraged malicious Next.js repositories to trigger a covert RCE-to-C2 chain through standard ...

GitHub

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

If you have any questions about the code or the paper, feel free to contact Zhiyuan Zeng (zhiyuan1zeng@gmail.com or zyzeng@cs.washington.edu). If you encounter any issues while using the code or want ...

腾讯网

Skills 真的可以帮我干活了：把工单分析变成一个可复用的 Skill

Anthropic 刚推出 Skills [1]时，我非常兴奋。官方的态度也很明确：不要再执着于开发复杂 Agent，而是把精力放在 Skills 上。但在认真研究了一圈官方和社区的 Skills 示例[2]后，我很快冷静下来—— 几乎没有一个 Skills 能直接在真实环境中跑起来。当时我的判断是：这就是个玩具。直到最近，Claude Code 2.1.3 ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果