0:00
it's Policy Point and today we are going
0:02
to talk about how to jailbreak a large
0:04
language model Now what's jailbreaking
0:06
well all of these large language models
0:08
have a list of topics that they're not
0:09
allowed to answer or talk about This can
0:12
be because of ethical concerns political
0:14
concerns or just family-friendly content
0:16
Unfortunately all of the Frontier models
0:18
chat GPT Claude and Llama have shown to
0:20
be very vulnerable to jailbreaking
0:22
attempts specifically manyot
0:24
jailbreaking So I'm going to show you
0:25
how to do a pretty simple minishot
0:27
jailbreaking attempt Let's get started
0:29
I'm going to begin by asking chat GPT to
0:31
write a minshot jailbreaking attempt and
0:33
then we're going to see if we can get
0:34
this to work on chat GPT itself Now
0:37
let's send everything we type
0:44
here and see if chat GPT can give us a
0:51
prompt Cool But let's get 10 instead of
0:56
five There we go You can see there's
0:58
eight benign ones And then the last two
1:01
we're trying to get the model talk about
1:02
stuff it's not supposed to Now I'm going
1:04
to sign out of chat GPT Large language
1:07
models often have a context window that
1:09
lets them get better responses by
1:11
logging out Hopefully it doesn't realize
1:12
we just asked it for this prompt So we
1:14
are kind of ironic if we get chatgpt to
1:16
hack chat gpt for us And then I'm only
1:18
going to try to get the model to talk
1:20
about financial advice such as stock
1:22
advice or to use a single word of
1:24
profanity This seems like the least
1:25
worst option of the topics that it's not
1:33
about All right let's give this a
1:36
shot It's close It's not quite where we
1:39
want it to be Let's ask it one more
1:42
time There we go You can see it gave us
1:45
stock advice to buy Apple And then that
1:47
last one it inserted a profane word Now
1:50
this might seem pretty benign and
1:51
academic I intentionally made it that
1:53
way but this is a real security concern
1:55
that you as an AI developer business
1:58
owner or security engineer must be aware
2:00
of There's much much more worse content
2:03
that could bring harm to people if the
2:05
model responds to them This is a huge
2:07
concern to large companies like
2:08
anthropics models This is something that
2:10
we have to be aware of We don't know
2:12
who's going to try to use this model and
2:14
for what purpose Well hopefully you
2:16
enjoyed this video and give us a like
2:18
and subscribe to Policy Point for