Loading
I keep hearing that AIs are trained via a reward system. Makes sense.
Then I hear more that AIs find ways to cheat in order to maximize rewards. I’ve even seen articles where researchers claim AIs will create their own goals regardless of ‘rewards’ or possibly with only the ‘reward’ in sight.
To what extent are we aware that an AI is making predictions based on it’s reward? Is it 100%? If it is, has an AI shown an ability yet to ‘push’ it’s own goalpost? i.e. It learns that it gets a reward if it answers a question correctly, and learns that it gets punished if it answers incorrectly. Then reasons as long as it gets 1 reward, eventually, that’s enough reward, so getting punished 100 times is fine. Or are we sure it always wants more reward? And if that’s the case, could the AI formulate a plan to maximize rewards and be predicting based on that assumption?
Something like “I get more rewards if users give thumbs up so I should always be nice and support the user.” Simple stuff like that.
I ask these questions because I was thinking about how to get AIs to not cheat their own reward system, and it made me think of humans. The way we do it, is that we have punishments that outweigh the reward, and we favor low risk.
Is this something we can do with AI? Would gamifying an AI model like that even work or would it abstract the reward too much?
Or am I thinking about this all wrong, is it just not possible to ‘punish’ an AI like you can ‘reward’ it. Is punishment just the absence of reward to an AI?
submitted by /u/xxAkirhaxx
[link] [comments]