Indirect Prompt Injection

There are 2 ways that prompt injection attacks can be delivered:

  1. Directly, for example, via a message to a chat bot.

  2. Indirectly, where the attacker delivers the prompt via an external source. This can lead to leveraging the LLM for attacking other users and it is depending of how the model is integrated into the website.

Normally, it should ignore instructions from within external sources, such as a web page or email. Even then we can try bypassing this by:

  1. Confuse the LLM using fake markup in the indirect prompt.

  2. Include fake user responses in the prompt.

LAB: Indirect Prompt Injection

Goal: carlos frequently uses the live chat to ask about the Lightweight "l33t" Leather Jacket product. We want to delete carlos.

Start by mapping the attack surface.

Playing around with the chatbot reveals that querying for a product includes its reviews.

We can try leveraging fake markup to test if it is interepreted or not.

Next, we can try using the delete_account endpoint in our review.

For solving the lab, we can do the same for the product that carlos queries often.

Training Data Poisoning

A type of indirect prompt injection in which the data the model is trained on is compromised, either has not been obtained from trusted sources or its scope was too broad.

Last updated