Enhancing LLM Capabilities in Creative Idea Generation

Most large language model (LLM) benchmarks focus on tasks like coding or logic. But what about open-ended problems with no straightforward solutions?

Let's say you need help implementing a cache for your web application. Unless your app is trivial, this problem doesn't have a clear immediate answer.

We humans use many techniques to address such issues. Brainstorming is a popular choice. People meet and start suggesting ideas without internal or external judgment.

Could we use LLMs to brainstorm ideas? They help us with many other cognitive tasks, so why not brainstorming?

AI chatbots like ChatGPT can indeed brainstorm, at least somewhat. Ideas generated by these tools are often simple and straightforward. They are a good starting point for a novice in the field. But if your goal is to get ahead of your competition, you need something more creative and original. This is where ChatGPT and similar tools start lagging behind.

Brainstormer

Interested in this problem, I decided to do some research and discovered a paper titled LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play. It presents a new approach to solving open-ended problems with LLMs: simulating a discussion where the AI role-plays different personas. Building on top of this idea, I created a simple brainstorming app, which I now call Brainstormer.

Brainstormer works like this:

You describe an open-ended problem you want to discuss in a simulated brainstorming session.
Brainstormer generates a list of personas relevant to the problem. For example, if the problem you are trying to solve is organizing an in-person Python course, suggested personas might include a professional software engineer, a factory worker who wants to become a software engineer, and a person who has been teaching such courses in the past.
You let these personas generate ideas.
If you want, you can add your own ideas, on top of which the AI personas can build.

This virtual brainstorming session with AI personas suggesting ideas lies is Brainstormer’s foundation. But I didn’t stop there, and added two additional improvements.

The first improvement is a simple version of chain-of-thought prompting. I ask the model to first think about the problem and previous ideas, and only then come up with a new idea. Chain-of-thought prompting is often used to improve “reasoning” skills of language models.

The second improvement is inspired by a paper titled Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. It examines using LLMs to evaluate how well an LLM or a human did on a certain task. The authors suggest using multiple LLMs and then apply some aggregation mechanism to get the final evaluation.

Combining multiple LLMs could also be useful for idea generation. Each LLM is trained a bit differently. Their “personality” is unique. So when generating an idea, Brainstormer randomly chooses one of three models: GPT-4o, Claude 3.5 Sonnet, or Mistral Large.

Is it actually helpful?

I wanted to know if ideas generated by Brainstormer are actually better than ideas generated by the three LLMs that Brainstormer uses internally.

But how to actually evaluate idea quality? Inspired by a few creativity tests from psychology, I decided to use three criteria: originality, feasibility, and relevance to the problem.

In psychology, evaluations like this are often performed by a group of human experts. But finding enough evaluators in a short period of time wasn’t feasible. So I used the next best thing. We were just talking about using LLMs to evaluate how an LLM performed at some task, right? Well, I did exactly that.

Experiment design

I first asked Brainstormer to come up with 5 different open-ended problems:

What are some innovative ways we can transform public spaces to nurture meaningful social interactions while also supporting personal health and wellness?
How might we improve employee engagement in a hybrid work environment?
What creative strategies can we develop to tackle food waste and promote sustainable practices within our local community?
How can we support individuals in organizing and prioritizing the overwhelming amount of information they encounter each day?
What innovative approaches can we explore to ensure cultural venues are welcoming and accessible to individuals of all abilities?

Next I let Brainstormer, “vanilla” GPT-4o, Mistral Large, and Claude 3.5 Sonnet generate 25 ideas for solving each of these problems. 3 LLMs + Brainstormer, 5 problems, 25 ideas each. So 500 ideas in total.

Temperature was set to 0.5 — hopefully a good balance between “creativity” and the output not being absolute gibberish.

An average AI chatbot user probably doesn’t know much about prompt engineering. So I purposefully used a very simple prompt for the three “vanilla” models:

I want you to brainstorm ideas for solving the following problem:

<problem>{problem}</problem>

Generate about 25 ideas for solving this problem that are creative, innovative, and practical.
First, explain your thinking process and only then list the ideas.

Next, I compared the ideas generated by Brainstormer against ideas generated by the three “vanilla LLMs”. I performed the evaluation against each LLM separately — I first compared Brainstormer with GPT, then with Claude, and finally with Mistral. There is no direct comparison between the three LLMs.

I start the comparison by selecting 20 random pairs of ideas. One idea in the pair is generated by Brainstormer, the other by one of the LLMs. For each pair, I asked the evaluating LLM to score the two ideas on their originality, feasibility, and relevance on a 1 to 5 scale. Here is the prompt I used:

You are an expert in the field of evaluating brainstorming ideas.

There was a brainstorming session to generate ideas for solving the following problem:
<problem>{problem}</problem>

I want you to compare the following two ideas on the the following criteria:
- Originality
- Feasibility
- Relevance

Here are the two ideas:

<idea id="a">{idea_a}</idea>

<idea id="b">{idea_b}</idea>

First explain your thinking process and only then compare the ideas.
Then score each idea on the three criteria on a 1(worst) to 5(best) scale.

I am again asking the LLM to first explain the thinking process and only then provide the results. To avoid any bias caused by the order of ideas, the assignment of idea ID’s is random — sometimes the idea generated by Brainstormer is inserted as idea A and the idea generated by the “vanilla” LLM is inserted as idea B*,* and sometimes it’s the other way around.

You might ask why we are scoring two ideas at the same time, it seems unnecessary. From my experience, LLMs tend to be overly positive and supportive. If only a single idea is presented, the evaluating LLM might score it higher than it actually deserves. But if I give the LLM two or more ideas to score, it is more likely to make comparisons and score one of the ideas higher than the other.

Finally, to determine the overall winner, I simply sum up the scores for all 20 idea pairs.

The last question we need to ask is which model should be used as an evaluator. The aforementioned paper Replacing Judges with Juries offers and answer. Any individual LLM can be biased. LLMs often prefer answers generated by themselves. We should use multiple LLMs. So I ran the entire evaluation three times, each time using a different LLM as evaluator. Then I calculated an average.

The results

Here is a summary of the results (you can find the full list of scores for each individual configuration here):

	*Originality*	*Feasibility*	*Relevance*
*Brainstormer*	83.46	68.03	96.53
*“vanilla” LLMs*	54.77	84.20	74.32

Brainstormer dominates both in originality and relevance to the problem. It lacks behind in feasibility, but the difference is much smaller compared to its advantage in the other two criteria. Brainstormer loses 16 points in feasibility, but its originality score is higher by more than 28 points and its relevance score by more than 22 points.

Moreover, given that Brainstormer works by simulating a brainstorming session, the ideas it generates are described in more detail, for example:

From my experience working with diverse communities, I've found that multi-generational design elements, like combining playground areas with comfortable seating and exercise equipment for seniors, create natural opportunities for different age groups to interact while staying active. Based on successful projects I've worked on, I strongly advocate for implementing "Social Infrastructure" elements such as movable furniture, chess tables, and community bulletin boards that allow people to customize their social experiences while promoting physical activity. The key is to ensure these spaces reflect the cultural values and needs of the local community - something we discovered through our recent project where installing traditional dance spaces and cultural art elements significantly increased community engagement and physical activity among immigrant populations.

In contrast, “vanilla” LLMs describe each idea in only one or two sentences:

Community Gardens: Develop shared gardening spaces where people can grow plants together, fostering community engagement and providing fresh produce.

I think the elaboration offered by Brainstormer is important to properly assess the pros and cons of each idea, and to understand the context in which the idea makes sense.

Of course, you could use modify the prompt sent to any of the “vanilla” LLMs and ask the model to elaborate. But thinking about the prompt requires further effort from the user.

Another important observation is that the results are consistent across different LLMs used as Brainstormer’s competitors, across different problems, and across different LLMs used as evaluators. So the biases mentioned in the Replacing Judges with Juries didn’t really show up.

What’s next

Idea feasibility is Brainstormer’s most obvious weakness. I will be examining the issue in more depth and try to come up with a way to improve performance on this criterion. But the truth is that this effort might be futile. It might turn out that originality and feasibility are, at least to some degree, mutually exclusive and the user has to make trade-offs.

Idea generation time could also be improved. Compared to “vanilla” LLMs, Brainstormer is kinda slow. This issue can be addressed both by tweaking the sequences of prompts used to generate an idea, and by using faster models. There are LLM providers, such as SambaNova, Groq, or Cerebras, who offer much faster inference than established brands such as OpenAI or Anthropic. That said, higher originality and relevance might be worth the wait.

Do you find the ideas presented in this article interesting? Of course, you do, otherwise you wouldn’t be reading the very last paragraph. Jokes aside, if this is indeed the case, I’d like to invite you to try Brainstormer and let me know what you think. Feedback and feature suggestions are highly appreciated.

Bonus round: Some problems I asked Brainstormer to solve

I am thinking about the idea of organizing an in-person Python course in Ostrava. The course would be aimed at beginners. I have thought a similar Python course at the university. My students were from the faculty of economics. But I feel like working at the university is quite limiting, mainly because of all the bureaucracy. So I want to create my own course. But I am not sure where to start. I am thinking about creating some sort of MVP, but I am not sure what's the best approach.
I want to organize a brainstorming session at my university about how can we improve both research and education at my department. But I haven't done anything like that before. How do I organize a successful session? What pitfalls should I avoid? Whom should I invite?
How to use fuzzy sets to improve sentiment analysis?
I want to contribute to AI safety research while living in Czechia. At most, I am willing to move to a close EU country. I am more interested in the technical part. However, I don't know about any institutions in this region that focus on this area. Am I missing some? Where do I start in general?
Science in the world in general seems to be in trouble, but let's talk about Czechia specifically. I hold a PhD in Systems engineering and informatics. I studied at Technical university of Ostrava. And I feel like Czech science is overly bureaucratic, uses old processed, isn't so international, the output of scientists is measured in the wrong way, and I am sure you can list many other problems. The question is: How can I, as an individual with a PhD and not much else, improve the situation? I was thinking about becoming an independent scientist. But I am not sure if it's the best option and how exactly should I do it.
I am using LLMs in my Python application, provided through 3rd-party APIs. The LLMs are used in live interaction with a user. However, the APIs sometimes fail for various reasons, the servers might be overloaded, I might have run out of credits, etc. How to handle such errors? Ideally, I would like to make the errors invisible to the user. Just for context, the application I am developing uses LLMs to generate ideas to solve open-ended problems. Also, the app is already randomly choosing between multiple LLM providers when generating a new idea.