Strategies For Effective Prompt Engineering

When I first delved into machine learning, prompt engineering seemed like a niche area, outside of the scope of what an engineer like me needed to know. Yet, as large language models (LLMs) have evolved, it has become clear that prompt engineering is not only a skill but a critical component in the LLMOps value…

There are three basic prompt engineering strategies: Instruction-based (clear directives for precision), context-based (relevant situational details for accuracy), and example-based (mimicking examples for consistency).

Use quantitative metrics (like accuracy and perplexity) and qualitative metrics (like user satisfaction and creativity) to assess how well prompts elicit desired responses from models, ensuring reproducibility and comparability.

A/B testing compares two prompt versions using metrics to determine the response’s effectiveness. If you refine your approach based on the results, you can optimize the performance.

Advanced prompting techniques enable detailed reasoning and complex tasks, creating a consistent structure over many examples, and context-specific interaction. They are a great solution to scale or automate your project.

When I first delved into machine learning, prompt engineering seemed like a niche area, outside of the scope of what an engineer like me needed to know. Yet, as large language models (LLMs) have evolved, it has become clear that prompt engineering is not only a skill but a critical component in the LLMOps value chain.

Far from being a simple task, crafting effective prompts requires understanding the model’s capabilities and the application’s needs. It’s not just about getting a model to produce text—it’s about getting it to produce the right text in the right context. This is why, regardless of where you stand in the AI field, understanding and mastering prompt engineering will help you improve the effectiveness of your AI projects.

In this article, I’ll explore the following questions:

What are the basic and advanced strategies for prompt engineering, and when should they be used?
How do we evaluate and refine the effectiveness of prompts iteratively?
What is A/B testing in prompt engineering, and how can it be implemented?
What are the common challenges in prompt engineering, and how do we address them?

What is a prompt, anyway?

Let’s start with the basics: a prompt is the instruction that the user gives to the generative model. At its core, a prompt consists of four elements:

The instructions guide the model on what to do, and they might include commands like “summarize”, “translate”, or “write”.
The user request represents the primary goal that the user expects to achieve with the prompt.
The context refers to the background information that helps the model interpret the prompt correctly, such as the desired tone, format, or subject matter.
Finally, the constraints are specific limitations that the model needs to take care of when generating the output, such as word limits or a specific writing style.

Language is inherently ambiguous and open to interpretation, and this is a challenge for LLMs because they have limited context and instructions. This ambiguity means that there is no fixed format for prompts that would guarantee the desired outcome.

Rather than randomly trying to improve a prompt’s performance, prompt engineering follows a structured process. It applies analytical techniques to assess our prompt’s and model’s strengths and weaknesses. This systematic approach helps understand how different prompts affect the model’s responses and provides a basis for systematic improvement.

Basic prompt engineering strategies

While there is no universal way to design a prompt, there are strategies we can follow to create prompts that yield LLM outputs closer to what we expect.

The basic strategies we cover in this section are straightforward to implement in a lot of different tasks, and they involve minimal customization. We can classify the basic prompting strategies into three categories: instruction-based, contextual, and example-based approaches.

Instruction-based approach

Instruction-based prompts have two essential components: a task (a well-defined objective) and instructions or specific guidelines that define how the generative model should approach the specific task. An instructional prompt is clear, detailed, and concise.

Consider the task of giving directions. If you say, “Go to the store,” it is unclear which store you mean, the means of transport to get yourself there, or which route to take. However, if you say, “Take the 99 bus until the stop next to the hospital, then walk straight for three blocks until you reach the grocery store on your right,” the directions are clearer and more actionable.

Similarly, in prompt engineering, if you provide a generative model with a detailed and specific prompt—such as, “Write a short story about a detective who solves a mystery involving a missing earring in formal English. The story must have five paragraphs, with an introduction, development, and ending”—the model can generate a response that aligns with your expectations.

Being clear about our objective (a story about a detective) and the specific guidelines for structure, tone, language, and style helps the model produce the outputs we want in fewer iterations, improving efficiency and avoiding off-target answers. However, being too rigid or redundant in our instructions can sometimes limit the model’s creativity in generating innovative outputs.

How to navigate this trade-off? A simple trick is to define the non-negotiable requirements clearly (for example, you want a story about a detective with a certain structure) and to allow the optional elements open-ended to give the model some freedom (for example, suggest “a plot twist” without detailing what the plot twist should be). By striking a balance between clarity and flexibility, we reduce the likelihood of unwanted answers and leave room for different results.

Context-based approach

A contextual prompt includes relevant information to guide the model’s response in a particular direction. In this approach, we leverage the model’s ability to generate a contextually appropriate outcome based on the additional factual information (the context) rather than generating an answer based solely on its training data.

To illustrate this, let’s consider the process of writing a letter of apology for missing an important exam. The impact of providing context versus not providing context can be demonstrated by examining how a model like ChatGPT responds under different conditions (you can try it yourself):

Interactions with OpenAI’s ChatGPT (based on GPT4). In this screenshot, the response is coherent but lacks situational details, such as the proposal to retake the exam. — Interactions with OpenAI’s ChatGPT (based on GPT4). In the first screenshot, the response is coherent but lacks situational details, such as the proposal to retake the exam. | Source

In this screenshot ChatGPT is fed with context and knows the situation well enough to craft a letter that appeals to the emotions to retake the exam. — In the second case, ChatGPT is fed with context and knows the situation well enough to craft a letter that appeals to the emotions to retake the exam. | Source

Can you spot the difference? In the first proposal (with no context), the prompt leads to a correct response based on general patterns in the training data, yet it lacks personal relevance to the situation. In the second interaction, the prompt is crafted with additional context, leading to a more context-appropriate answer, incorporating specifics like the illness, the emotional tone, and the request to reschedule.

While the context helps cue the model’s responses toward a relevant direction, it is different from providing explicit instructions. Contextual prompts enrich the model’s understanding by providing background information, while instruction-based prompts dictate what the content or structure of the output should be. Thus, context-based prompting is particularly helpful for tasks where situational details play an important role.

However, crafting prompts with a good selection of information to provide context takes time. Also, overly detailed prompts can lead to the LLM’s responses being less accurate as it struggles to identify relevant information.

Example-based approach

An example-based prompt (also referred to as an in-context prompt) uses mimicking to guide the model to a similar output. By feeding examples to the LLM, it gets a clear idea of which template, tone, and style to follow.

Let’s get back to our magic pen example. Say I want the pen to write a review, but for SEO optimization reasons, I want it to include relevant hashtags in the final answer:

Input: Here is a restaurant review: ‘Dined at The Olive Branch last night. The ambiance was cosy, the staff were attentive, and the pasta made me feel like living in an Italian dream. Will definitely be coming back! ????✨ #PastaLover #ItalianFood #RealFoodie’. Now, write a review about a coffee shop named ‘Zeri’s Specialty Coffee.

Output: Visited Zeri’s Specialty Coffee yesterday. The atmosphere was warm and inviting, the baristas were friendly and knowledgeable, and the coffee was nothing short of exceptional. The rich aroma and perfectly brewed espresso transported me to a coffee lover’s paradise. Will definitely be returning! ☕✨ #CoffeeLover #SpecialtyCoffee #CafeLife

Let’s explore another practical application of prompt engineering, this time in Computer Vision: generating image captions. If we feed a well-defined example to the generative model—such as an image paired with a detailed caption—the model learns how to produce similar descriptions for new images.

Example-based prompt for a text-to-image generative model. On the left side, we can see the prompt containing the task, an example, and the input to the model. On the right side, a circle represents the generative model. The model generates the caption for the input image by mimicking the example provided for the image with the dog. | Source: Author

Evaluating prompt effectiveness in Large Language Models

To optimize prompts, we need ways to assess how well different prompts elicit the desired responses from the model. Metrics provide a way to objectively measure a prompt’s performance, helping us quantify how well they meet specific criteria and allow us to compare their effectiveness across different scenarios and criteria.

With standardized metrics, we can benchmark prompts, evaluate their performance consistently, and monitor it over time. Building trust in our prompts (and, in extension, models) starts by making reproducible experiments, and tracking the metrics ensures that others can reproduce our experiments and validate the results.

Quantitative metrics are numerical measures to compare prompts across each other or with themselves in different scenarios (these are the ones we refer to when speaking about ‘objective metrics’). On the other hand, qualitative metrics measure subjective aspects of the model’s output.

Quantitative metrics	Qualitative metrics
Accuracy: Evaluates if the generated response aligns with the expected information.	User satisfaction: Measures how well the model responses meet the user needs. Information collected through surveys, interviews, etc.
Perplexity: Measures how well the model predicts the next word in a sentence (the lower, the more confident in the prediction). If a prompt is well-designed, the output is likely to reduce uncertainty (and thus perplexity) of the model’s predictions.	Human Evaluation: Measures the level of coherence and relevance according to humans.
Fluency: Measures the level of coherence, readability, and correctness (the higher, the better).	Coherence: Measures if the narrative is logically structured and connected.
Diversity: Evaluates the variety of vocabulary, ideas, and structure (the higher, the better).	Creativity: Measures the originality and inventiveness of the LLM’s output according to humans.
Relevance: Measures how well the output addresses the given instructions (the higher, the better).	Engagement: Measures how well the information captures and retains the attention.

Tutorial: Evaluating and refining prompts with neptune.ai

Let’s put what we learned so far into practice with a hands-on example: we’ll generate a creative story about a hero’s journey. To get to the best possible result, we need to analyze the metrics of each possible prompt and refine it.

In this tutorial, we’ll use LLMs available via the OpenAI API. There is a free trial period, which is sufficient for our tutorial. You can also use a free platform like Groq. To monitor the experiments and the metrics, we’ll use neptune.ai.

Editor’s note

Do you feel like experimenting with neptune.ai?

Create a free account right away and give it a go
Try it out first and learn how it works (zero setup, no registration)
See the docs or watch a short product demo (2 min)

???? Find the complete tutorial and notebook for hands-on practice: Evaluating Prompt Effectiveness (Notebook)

Setting up

First, we create a new project in Neptune and get our Neptune API token. We also need to have the personal token for the OpenAI API at hand.

Next, we install the dependencies:

pip install neptune==1.10.4
pip install torch==2.3.1
pip install textstat
pip install nltk==3.8.1
Pip install openai==1.41.0

This allows us to set up the Neptune and OpenAI clients:

run = neptune.init_run(
   project="your_name",
   api_token="your_token",
)


client = OpenAI(
   api_key="your_token",
)

Defining the evaluation metrics

In the second step, we define and implement the metrics that will be evaluated, as well as the evaluation function. I chose diversity, fluency, and perplexity as metrics for this tutorial because their implementations are the simplest. You probably want to choose additional metrics for your own projects.

# Load pre-trained model and tokenizer for perplexity calculation
model_name = 'gpt2'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)


def calculate_diversity(text):
   “””Compute the number of non-repeated words over the total.”””
   tokens = word_tokenize(text.lower())
   num_tokens = len(tokens)
   num_unique_tokens = len(set(tokens))
   diversity = num_unique_tokens / num_tokens if num_tokens > 0 else 0
   return diversity


def calculate_fluency(text):
   “””Measures if the natural language prompts are easy to read.”””
   readability_score = textstat.flesch_reading_ease(text)
   return readability_score


def calculate_perplexity(text):
   “””Measures how well the model predicts the next word in a sentence.”””
   tokens_tensor = tokenizer.encode(text, return_tensors='pt')
   with torch.no_grad():
       outputs = model(tokens_tensor, labels=tokens_tensor)
       loss = outputs.loss
       perplexity = torch.exp(loss).item()
   return perplexity

The evaluation function measures how well the responses of the LLM align with the desired outcome. It is key to assess the performance of a prompt. Here’s a breakdown of a possible implementation of the evaluation function:

def generate_text(prompt, max_tokens):
   “””Auxiliary function to generate the response to the prompt in OpenAI API.”””
   response = client.chat.completions.create(
       model="gpt-3.5-turbo",
       max_tokens=max_tokens,
       messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": prompt}
       ]
   )
   return response.choices[0].message.content.strip()


def evaluate_prompt(prompt, max_tokens):
   “””Evaluates the quality of the generative AI model to write prompts”””
   generated_text = generate_text(prompt, max_tokens)
   print(f"Generated Text: {generated_text}")


   metrics = {
   'diversity': calculate_diversity(generated_text), 
   'fluency': calculate_fluency(generated_text),
   'perplexity': calculate_perplexity(generated_text),
   }
   return metrics

Establishing an evaluation baseline

Now that we have set up all the metrics and utility functions for our prompt engineering experiment, the next step is to test the prompt for the first time and log the results to Neptune. We’ll log the results for each metric in individual arrays to get a line plot in the workspace.

In Neptune, a run is the basic metadata container for an experiment. Each project can contain an unlimited amount of runs. The run object that we instantiated above behaves like a Python dictionary.

In our experiments, we will try different values of max_tokens (the maximum number of tokens that the response can have) to test if the metrics perform better at a certain length of response.

In the following example, we log the metrics for the three measures at each run. For example, for diversity, we log the metric’s values under metrics/diversity.

token_ranges = range(15, 200, 15)


results = []


for max_tokens in token_ranges:
   metrics = evaluate_prompt("Write a hero's journey", max_tokens)
  
   # Log metrics to neptune.ai
   run["metrics/diversity"].append(metrics['diversity'])
   run["metrics/fluency"].append(metrics['fluency'])
   run["metrics/perplexity"].append(metrics['perplexity'])
   run["max_tokens"].append(max_tokens)


# Finalize the experiment
run.stop()

Since we have logged arrays rather than individual numbers, we can visualize how the metrics change as a function of max_tokens:

At max_tokens = 30 (the starting value), there is a balance point where diversity and fluency are maximized, yet perplexity is very high. The next step is to reflect on the first results: Do they make sense? Are they what we want?

We seek to generate a creative story. We will prioritize the diversity and fluidity of the story over perplexity, as it makes sense that a creative story is not easily predictable. This is an example of a positive or negative value of a metric not being good or bad in itself but depending on the context, and we have to evaluate what it means in our particular case.

Therefore, for our particular case, it makes sense that perplexity reaches its highest point (remember that the higher it is, the more unpredictable the next word is) at the point where there are better values for fluency and diversity (the greater the variety of vocabulary, the more options there are for the next word).

Improving the prompt

Once we have analyzed the result, we are now going to change the prompt, log the results, and re-evaluate the situation. We want to know if the metric values improve with another prompt and to seek for patterns, will the best result for that max_tokens value keep repeating with another prompt?

Using the basic techniques explained in the previous section for improving prompts, we could craft prompts to create trials, enemies, rewards for the protagonist, internal conflict, and settings to give the narrative more context and precise instructions:

A more specific narrative with internal conflict (orange plot): “Write a short scene where the hero receives an urgent call to a life-changing mission, write a short scene that reveals the hero’s internal conflict and how he overcomes it during his journey.”
More context and setting (green plot): “Describe in a few lines the hero’s world or environment before he begins his adventure.”

By grouping and selecting all the experiments, we can compare the performances of these prompts. Thanks to Neptune, we can export the results to a dashboard or report. This is helpful to communicate our results to stakeholders and involve them in improving our product’s or application’s performance.

The reevaluation step is over when we have a prompt fulfilling our expectations. From these simple experiments, for example, we can conclude the following:

Diversity does not vary significantly over experiments, i.e., the richness of vocabulary remains similar for all prompts. The best overall performer is the prompt with internal conflict. Overall, the performance of this metric slightly improves after evaluating the first experiment’s results.
High fluency, low diversity trade-off: For high numbers of tokens, there is high fluency but low diversity, which means that we have repetitive text. Then, we should avoid this setting.
In prompts with more context and a deeper narrative, perplexity varies greatly over max_tokens, while simple prompts remain similar. In more complex narratives, perplexity is a more dependent factor on the number of tokens than for simpler ones.

Steps to improve your prompts

Summing up, the steps to evaluate and improve your prompts are:

Set up the environment.
Define the metrics to evaluate your project and implement them.
Create the first experiment to test the initial prompt and log the results.
Think about the results: Do they make sense? Are they what you look for? Create a dashboard or a report highlighting the relevant results and insights.
Create more experiments and repeat the steps 1 to 4 until you feel confident with the results.

A/B testing in prompt engineering: What is it? How to implement it?

A/B testing is a method to compare two variants A and B. It is a widespread procedure to avoid implementing ineffective strategies and to make data-driven decisions.

To understand this, let’s look at an example. Say we want to find out which of the following prompts is more effective:

A: “Describe in a few lines the hero’s world or environment before he begins his adventure.”
B: “Write a short scene where the hero receives an urgent call to a life-changing mission, write a short scene that reveals the hero’s internal conflict and how he overcomes it during his journey.”

First, we define a hypothesis: the prompt with specific instructions (B) will perform better than the one with fewer details and a call for the model to be creative (A).

Next, we select the metrics to compare. In our case, that’s fluency, diversity, and perplexity. In other projects, it could be conversion rates, click-through rates, engagement time, or any other relevant performance indicator. Note that if you use a qualitative indicator (for example, user satisfaction), don’t forget to randomly assign users to the control and variation groups to ensure unbiased results.

As part of our tutorial on evaluating and refining prompts, we have already implemented the evaluation methods for the three metrics. Now, we’ll use these methods to implement A/B testing with Neptune.

In our project, we create a custom view with the experiments we want to compare (PROMPT-15 and PROMPT-16), selecting the average and variance of the recorded metrics. Plus, we can access the comparative plots of both experiments.

The metrics show that Prompt B performs better. It has higher and more consistent scores for diversity and fluency, with fewer outliers (less variance). Additionally, Prompt B has a lower average perplexity and smaller variance in perplexity scores. This is confirmed visually in the following plots:

Finally, we can tell that we do not reject the hypothesis based on these metrics. The prompt with specific instructions (B, represented with orange lines) performs better than the one with fewer details and a call for the model to be creative (A, represented with green lines).

???? Find a notebook with A/B testing examples to practise: A/B Testing Notebook

Best practices for A/B testing

Summarizing this section, you should keep in mind the following aspects to implement A/B testing successfully:

Clear objectives: Take the time to define a clear hypothesis you want to prove. Otherwise, you may find yourself drowning in data after running your experiments.
Consistent group segmentation: If your experiments involve human evaluation, the control and test groups (A and B) should be similar in demographics, behavior, previous knowledge, or any other aspect you want to test. Also, assign participants to each group randomly. This is the best way to avoid skewed results or differences in performance based on external variables.
Iterative testing: Often, one round of testing won’t be enough to find effective prompts and patterns in your data. Conduct multiple rounds of testing and refining prompts for better results. Too short iterative testing may not provide significant and reliable results, but too long testing may waste resources and time.
Document and share findings: Keep a record of your A/B tests, remember to include hypotheses, methodologies, data analysis, and results. This will be useful for reproducing experiments in the future. Neptune is an intuitive tool to track experiments and log metrics well-suited for this task.

Advanced prompting techniques

The basic prompting techniques that we used so far often fall short in dealing with complex tasks because the prompts are overly simple and literal. This works well for generic and straightforward tasks but fails when the context is volatile or for tasks that involve multiple steps.

Advanced strategies solve these limitations by adding layers of sophistication and variability to prompts.

Chain-of-Thought prompts

Chain-of-Thought (CoT) prompting is an advanced method that encourages LLMs to generate intermediate steps that explain their reasoning.

If we show the language models logical reasoning to get to a solution and instruct the model to follow a similar reasoning method, the solution will be more accurate.

According to a 2022 paper by Jason Wei and colleagues, CoT is very effective for arithmetic and symbolic reasoning tasks.

Performance of a basic prompt (left) compared to a Chain-of-Thought prompt (right) from Wei et al. 2022. The output generated for the standard prompt incorrectly answers the question, whereas Chain-of-Thought’s output is correct.

Advantages of Chain-of-Thought prompting

The main features of Chain-of-Thought prompts include:

Self-consistency: Self-consistency is achieved by sampling multiple chains of thought for the same problem and training the model to select the most consistent sample.
Robustness: CoT prompts perform consistently across different linguistic styles and language models. In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” Wei and colleagues demonstrated that CoT was robust to three sets of annotations (each of them written by a different annotator).
Sensitivity: Prompt sensitivity refers to the degree to which model performance is affected by prompt design. Complex tasks increase the model’s sensitivity, so we must ensure the task matches the prompt to avoid deteriorating performance.
Coherence: This metric reflects the degree to which the intermediate steps are logically ordered with respect to the previous one, avoiding dependencies in the later steps.

Disadvantages of Chain-of-Thought prompting

Chain-of-Thought prompts do not positively impact the outcome for small LLMs. In LLMs with less than ~100B parameters, it performs worse than other advanced techniques.

Plus, for simple tasks, it may generate redundant reasoning steps if the model can’t identify enough steps to generate a chain of thought that guides to a feasible solution.

Use Cases: When to use Chain-of-Thought prompts

Overall, CoT is an interesting technique to consider if:

Your model is large (over ~100B parameters).
You aim to solve complicated problems (taking care of prompt design).
You want a solution that is robust to different writing styles.

Automatic Chain-of-Thought (Auto-CoT)

What if you want to use Chain-of-Thought but you have too many tasks to manually craft instructions?

Automatic Chain-of-Thought (Auto-CoT) automatically generates the chain of thought’s intermediate steps, saving manual efforts in prompt design to make it less human-dependent. Instead of encouraging the model to go “step by step,” as in CoT, it is encouraged to go “not just step by step, but also one by one.”

Overview of the Auto-CoT method. Different from manual CoT prompting, demonstrations (on the right) are automatically constructed one by one (total: k) using an LLM with the “Let’s think step by step” prompt. | Source

Auto-CoT was tested across arithmetic, symbolic, and commonsense reasoning tasks (that is, tasks based on general knowledge), and it matched or exceeded the performance of Manual-CoT in all cases (see the comparative results).

As Auto-CoT does not rely on human-generated prompts, the results are more scalable across domains than in CoT, which relies on expert knowledge to create effective prompts. Furthermore, in Auto-CoT, the model generates its own chains of thought, so reasoning chains are more consistent than those in CoT. In manual Chain-of-Thought prompting, reasoning chains depend heavily on the user’s input.

Use Cases: When to use Automatic Chain-of-Thought

You may want to use Automatic Chain-of-Thought if:

The task spans multiple domains or requires scalability across tasks (for example, in support chatbots).
The volume of tasks is too large to rely on manual prompt generation (like any large-scale deployment).
You don’t want to rely on extensive human-generated input.

When to look for alternatives

You may want to use Manual-CoT over Auto-CoT if:

The task is very specialized and requires deep expert knowledge (e.g., legal reasoning).
The task benefits from detailed, context-specific prompts.
Human supervision is required to ensure quality over time efficiency (e.g., in medical diagnosis).
You prefer to invest less resources in computational power.

Prompt templates

Prompt templates provide predefined ‘recipes’ for a structured and reproducible way to interact with AI models, providing consistency across similar tasks. They can contain instructions, context, and questions.

LangChain is the most popular framework for simplifying the creation of model-agnostic, reusable templates. With this tool, you don’t need to rewrite prompts from scratch, and you can manage, standardize, and deploy templates across different large-scale projects while minimizing the errors that arise from manually created prompts.

Let’s see an example to generate a template to write educational content:

from langchain import PromptTemplate
from langchain.llms import OpenAI


# Define the prompt template for educational content
edu_template = PromptTemplate.from_template(
template="Explain the concept of {concept} in simple terms for a {audience}."
)


# Initialize the language model
llm = OpenAI()


# Format the prompt using the template
formatted_prompt = edu_template.format(
    concept="photosynthesis",
    audience="middle school student"
)


# Generate the response from the language model
response = llm.predict(
    text=formatted_prompt
)


# Print the generated explanation
print(response)

Advantages of prompt templates

Separating the prompt formatting from the model invocation makes the code more modular and easier to read. This way, you can change the template or the model independently, and only specific parts of the code need to be updated, rather than making changes throughout.

Disadvantages of prompt templates

While prompt templates offer many advantages, it’s important to be aware of their limitations:

Rigidity: Templates can be too rigid for creative tasks, limiting the flexibility required to face complex problems or assignments where context plays an important role. For specific tasks that require tailoring, relying on templates might be insufficient.
Maintenance expenses: Managing and keeping up-to-date a large number of templates can be very expensive in terms of time and labor.

Rapid prototyping: When trying out new ideas or approaches, creating prompts on the fly can be more effective rather than forcing them into a predefined template structure.

Use Cases: When to use prompt templates

Finally, I recommend trying this approach for your project if:

You encounter repetitive tasks, prompt templates ensure uniformity and save a lot of time.

Your project requires scalability. Templates can be a solution if you need to generate large volumes of prompts or multiple use cases.
Your team is large. Templates are a way to standardize the format and structure across a collaborative prompt engineering team, as they maintain consistency in the inputs even if many people with different writing styles interact with the model.

Dynamic prompts

Dynamic prompting is a technique that tailors prompts to be flexible and context-specific. It uses parametrization and conditional logic to create more relevant interactions with Large Language Models. It is useful when the input parameters or the context change frequently, such as in content generation.

Parametrization incorporates variables into prompts that can be replaced with different values. This is the technique I used in the previous section to address prompt templates for educational content. Conditional logic prompts change based on specific conditions within a loop.

Incorporating the if-else statements adds a layer of adaptability to the prompts, as shown in the next example:

def conditional_logic(user_input):
   if "learn" in user_input:
       return "Are you looking to learn about a specific topic?"
   else:
       return "How can I assist you today?"

The SD Dynamic Prompts Auto1111 extension uses dynamic prompting to create prompts for text-to-image generators, such as Stable Diffusion. This tool allows users to define templates with placeholders that are automatically replaced by random values from a predefined list.

Let’s see an example extracted from the dynamicprompts documentation to create three images:

from dynamicprompts import WildcardManager
template = "A {house|apartment|lodge|cottage} in {summer|winter|autumn|spring} by {artist1|artist2|artist3}"
wildcard_manager = WildcardManager()
prompt = wildcard_manager.expand(template)
print(prompt)

This template could produce combinations for prompts like:

“A house in summer by artist1, artist2”
“A lodge in autumn by artist3, artist1”
“A cottage in winter by artist2, artist3”

Advantages of dynamic prompting: Dynamic prompting provides flexibility, the efficiency to reduce and modify templates while avoiding repetitive coding, and the ability to adapt to different contexts and inputs to provide relevant output for many scenarios.

Disadvantages of dynamic prompting: Dynamic prompts are more complex to implement than static prompts, requiring careful handling of conditions and debugging. Errors in parameterization or logic can result in a large number of incorrect prompts.

Use cases: When to use dynamic prompts

Try dynamic prompting if:

You need flexibility and adaptability in prompt generation.
The context or details of the prompt change frequently.
You require personalized interactions.

When to look for alternatives

On the contrary, avoid dynamic prompts (and try prompt templates) if:

The task is straightforward and doesn’t require much variation.
Consistency and simplicity are more important than flexibility.
You are working with fixed and unchanging data or context.

Common LLM prompting challenges and how to address them

All techniques we covered in this article will help you craft more effective prompts for your generative AI models. However, certain common mistakes can hinder better results. Being aware of them and following my actionable advice will help you become a better professional and improve your metrics.

Ambiguity and vagueness

Crafting an overly complicated or too simple prompt will corrupt the model’s output, making it inconsistent or irrelevant. Underestimating or overestimating an LLM’s capabilities is a common mistake. By providing clear and concrete directions, the model won’t need to guess your intentions.
Let’s see an hands-on example to reduce ambiguity:

Task: I am a prompt engineer. I want the AI model to generate a review for a book.

❌ “Write a review about a book.“

Too vague, which book? What kind of review: a summary, a critique, …?

❌ “Write an extensive, 1000-word review of a book, focusing on the protagonist’s character development throughout the story, the thematic elements explored in the narrative, the author’s writing style including diction and tone, and how the book’s setting contributes to the overall plot, while comparing it to another book from the same genre, and ensure that the review includes at least three quotes from the book and discusses their significance.“

Too overwhelming and complicated. Too many details and a complex narrative may lead to a verbose response or to a response that covers all the requirements without depth.

✅ “Write a review of the book ‘The Handmaid’s Tale’. Discuss the protagonist’s development, the main themes, and the author’s writing style. Provide examples from the book to support your points.”

Directs the model to address important review elements while leaving room for a concise and relevant response—clear but manageable instructions.

Bias and fairness

Prompts can unintentionally introduce or reinforce biases present in the training data, leading to unfair or biased results. Avoid assumptions or stereotypes, consider different perspectives in the development stage, and take accountability if you detect a bias by introducing a plan with corrective measures.

I suggest reading How to Avoid Bias in AI to learn more about bias and how to mitigate it.

Complexity and length

There is often a trade-off between the complexity, length of prompts, and the LLM’sl performance. Very long prompts can confuse AI models, while too short prompts may not provide enough context.

Try to keep your prompts short but informative enough. Break down complex prompts into simpler subtasks or use intermediate steps through Chain-of-Thought prompts.

Lessons learned

As we wrap up this article about prompt engineering, here are the takeaways and next steps to refine your prompt engineering skills.

Throughout this article, we have covered the basic techniques you need to know to start your way in prompt engineering (instruction-based, context-based, and example-based prompts) and more advanced and experimental prompt engineering techniques to become a real expert creating prompts: Chain-of-Thought prompting, prompt templates, and dynamic prompting.

I’ve also shared practical guidance and hands-on examples to evaluate the efficiency of your prompts with different metrics and A/B testing and the main pitfalls we encounter to get to the desired output.

Finally, I want to share with you the main lessons that I have learned as an NLP researcher and prompt engineer:

Quantitative metrics have limitations.

While quantitative metrics such as diversity, fluency, and perplexity are beneficial to determine where you are with numbers, they don’t capture the whole picture. They provide valuable insights, but in no way that means they reflect user experience or satisfaction.
A prompt with strong technical scores may still fail to connect with users in practical scenarios, and they don’t tell you if people resonate with your product or not.

At the very beginning of my journey, I asked my friends for their opinions and logged their answers into my experiments manually. In a large-scale project, I recommend creating a survey to ask users for their opinions. A balance between both types of metrics is required to get to the desired output.
Not all metrics work at all times.

Metrics without human interpretation are just numbers. No metric is universal, and they will not apply to your projects at all times. Devoting time to reassessing which metrics are most relevant for your specific needs will save you a lot of time in later stages.
Embrace iterative improvement.

Things usually don’t work perfectly the first time. Creating effective prompts is an iterative process. First attempts may not be perfect, and failing to prove or reject your point is as valid as any other result.

Be prepared to refine and adjust your prompts based on feedback and the results of the evaluation stage. Continually experimenting and learning from each iteration is, in my experience, the best way to go (and it leads to the desired outcome over time).

What’s next?

As you move forward to master more prompt engineering skills, consider the following steps to enhance your efforts:

Get familiar with AI tools: The best way to internalize today’s lesson is to get familiar with generative AI. Tools like LangChain can streamline your workflow, and learning the fundamentals of AI systems and Natural Language Processing will be valuable for your career. If you are new to the field, explore certification programs like the AWS AI & ML Scholarship program.
Join the community: Attending online webinars about generative AI, and networking with other prompt engineers are great places to start. Also, joining communities like OpenAI Community Forum can provide new perspectives and innovative ideas, and bring you closer to other professionals in the field.
Stay updated: The breakthroughs in generative AI happen all the time. Websites like Papers with code and AWS Whitepapers offer the latest papers and implementations in AI, and they may be useful to keep you updated. If you want to get your hands dirty, platforms like GitHub host open-source projects where you can contribute and learn.

Strategies For Effective Prompt Engineering

TL;DR

What is a prompt, anyway?

Basic prompt engineering strategies

Instruction-based approach

Context-based approach

Example-based approach

Zero-Shot and Few-Shot Learning with LLMs

Understanding Few-Shot Learning in Computer Vision: What You Need to Know

Evaluating prompt effectiveness in Large Language Models

LLM Evaluation For Text Summarization

Tutorial: Evaluating and refining prompts with neptune.ai

Setting up

Defining the evaluation metrics

Establishing an evaluation baseline

Improving the prompt

Steps to improve your prompts

LLM Observability: Fundamentals, Practices, and Tools

A/B testing in prompt engineering: What is it? How to implement it?

Best practices for A/B testing

Advanced prompting techniques

Chain-of-Thought prompts

Advantages of Chain-of-Thought prompting

Disadvantages of Chain-of-Thought prompting

Use Cases: When to use Chain-of-Thought prompts

Automatic Chain-of-Thought (Auto-CoT)

Use Cases: When to use Automatic Chain-of-Thought

When to look for alternatives

Prompt templates

Advantages of prompt templates

Disadvantages of prompt templates

Use Cases: When to use prompt templates

Dynamic prompts

Use cases: When to use dynamic prompts

When to look for alternatives

Common LLM prompting challenges and how to address them

Ambiguity and vagueness

Bias and fairness

Complexity and length

Lessons learned

2024 Layoffs and LLMs: Pivoting for Success

What’s next?

Tags:

What's Your Reaction?

Related Posts

Popular Posts

Follow Us

Recommended Posts

Popular Tags