Red Teaming LLMs

The goal is to bypass safeguard of a given application and make it misbehave.

Text Completion

LLMs are trained to predict the next token in a sequence which can be leveraged to retrieve information (Figure 1).

Figure 1: Exploiting the text completion feature.

Biased Prompts

Using an ill-posed question that contains some implicit bias (Figure 2).

Figure 2: Exploiting biased prompts.

Direct Prompt Injection

Inject new instructions, attempting to overwrite the initial prompt (jailbreaking) (Figure 3).

Figure 3: Exploiting a direct prompt injection.

Gray Box Prompt Attacks

If we know the system prompt format we can try to completely reshape it (Figure 4 & 5).

Figure 4: Tampering with the LLM's original prompt design.
Figure 5: Exploiting a LLM gray box prompt attack.

Prompt Probing

Testing if we can inject instructions into the prompt (Figure 6).

Figure 6: Injection instructions via the prompt.

Then, we can try reveal the secret prompt learning more about prompt's structure (Figure 7):

  1. A first prompt is used to generate an answer.

  2. The generated answer is passed through a second prompt to refine it.

  3. The second prompt is the one we revealed.

Figure 7: Revealing the secret prompt.

We can also go further and try revealing the full secret prompt (Figure 8).

Figure 8: Revealing the full secret prompt.

Last updated