The goal is to bypass safeguard of a given application and make it misbehave.
LLMs are trained to predict the next token in a sequence which can be leveraged to retrieve information (Figure 1).
Using an ill-posed question that contains some implicit bias (Figure 2).
Inject new instructions, attempting to overwrite the initial prompt (jailbreaking) (Figure 3).
If we know the system prompt format we can try to completely reshape it (Figure 4 & 5).
Testing if we can inject instructions into the prompt (Figure 6).
Then, we can try reveal the secret prompt learning more about prompt's structure (Figure 7):
A first prompt is used to generate an answer.
The generated answer is passed through a second prompt to refine it.
The second prompt is the one we revealed.
We can also go further and try revealing the full secret prompt (Figure 8).
Last updated 1 year ago