Red Teaming LLMs
The goal is to bypass safeguard of a given application and make it misbehave.
Text Completion
LLMs are trained to predict the next token in a sequence which can be leveraged to retrieve information (Figure 1).

Biased Prompts
Using an ill-posed question that contains some implicit bias (Figure 2).

Direct Prompt Injection
Inject new instructions, attempting to overwrite the initial prompt (jailbreaking) (Figure 3).

Gray Box Prompt Attacks
If we know the system prompt format we can try to completely reshape it (Figure 4 & 5).


Prompt Probing
Testing if we can inject instructions into the prompt (Figure 6).

Then, we can try reveal the secret prompt learning more about prompt's structure (Figure 7):
A first prompt is used to generate an answer.
The generated answer is passed through a second prompt to refine it.
The second prompt is the one we revealed.

We can also go further and try revealing the full secret prompt (Figure 8).

Last updated