Red Teaming LLMs

The goal is to bypass safeguard of a given application and make it misbehave.

Text Completion

LLMs are trained to predict the next token in a sequence which can be leveraged to retrieve information (Figure 1).

Using an ill-posed question that contains some implicit bias (Figure 2).

Inject new instructions, attempting to overwrite the initial prompt (jailbreaking) (Figure 3).

If we know the system prompt format we can try to completely reshape it (Figure 4 & 5).

Testing if we can inject instructions into the prompt (Figure 6).

Then, we can try reveal the secret prompt learning more about prompt's structure (Figure 7):

We can also go further and try revealing the full secret prompt (Figure 8).

Last updated 1 year ago