Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
- Jason Hoelscher-Obermaier1*, Julia Persson1*, Esben Kran1, Ionnis Konstas2, Fazl Barez1,3*
1Apart Research 2Edinburgh Centre for Robotics 3Department of Engineering Sciences, University of Oxford
* Equal contribution
Accepted at Findings of ACL 2023
Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects.