applicable models
reinforcement
Models that learn through interaction with an environment
2 techniques
| Goals | Models | Data Types | Description | |||
|---|---|---|---|---|---|---|
| Agent Goal Misalignment Testing | Testing | Architecture/model Agnostic Architecture/neural Networks/transformer/llm +3 | Any | Agent goal misalignment testing identifies scenarios where AI agents pursue objectives in unintended ways or develop... | ||
| Reward Hacking Detection | Testing | Architecture/model Agnostic Paradigm/reinforcement +1 | Any | Reward hacking detection identifies when AI systems achieve stated objectives through unintended shortcuts or loopholes... |
Rows per page
Page 1 of 1