If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.
Source link
AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Antropic Says


If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.
Source link