I Turned a Word-Guessing Game into an AI Skill

A couple of days ago, I randomly saw a very small word-guessing game.

The rule was simple. The system hides a word, and you try to guess it. Each time you guess, it does not tell you the answer directly. It only gives you a percentage, like "your word is 72% close to the answer" or "this one is only 18%." Then you follow that percentage and slowly move closer to the answer.

What I thought at that moment was not "this game is fun." What I thought was: this kind of thing feels very suitable for an LLM.

If you look at it carefully, the most important part of this game is not the interface, not the state management, and not even picking the hidden word. There is really only one hard part: when the user throws out a word, the system has to decide how close that word is to the hidden answer. And this kind of "closeness" is not string similarity, and it is also not something traditional rules can describe very well. It feels more like a fuzzy but mostly stable semantic sense in a human brain.

For example, if the answer is "ocean," then "whale" is usually closer than "astronaut," and "beach" is usually closer than "screwdriver." People make this kind of judgment almost by instinct. But if you ask a normal program to simulate it with a lot of hard rules, it quickly becomes heavy and awkward.

So I built this little thing.

How I turned this game into a skill

From the beginning, I did not plan to make a full mini game, and I did not want to start with a frontend page. The idea in my head was very direct: can this weird little thing become a skill, so people can also play it in the terminal?

So what I made in the end was an AI skill called guess-the-word-game. I even opened a separate repo for this kind of thing, called weird-skills-lab. The name fits well. It is a place for small and slightly unserious skills that can still be played inside the terminal.

What I wanted to verify was also very specific: can this kind of gameplay be hosted directly by AI? If yes, then it does not have to become a web game. A skill is enough. The user only needs to ask the model to use this skill and start playing. Then the AI can choose a word, remember it, receive input for each round, score semantic similarity, and handle quit or win logic. The whole round can run like that.

Why this kind of game matches LLMs so well

If we break this game down, it has at least four requirements:

The system has to secretly choose one word and keep it unchanged during the whole session.
The user can guess one word or multiple words each turn.
If the guess is wrong, the system has to give a semantic closeness score that feels human.
Historical guesses need to be merged, deduplicated, sorted, and shown in a stable output format.

For these four points, 1, 2, and 4 are not very hard for a traditional program. The real trouble is point 3.

Should "whale" be 88% or 76% close to "ocean"? Is "beach" closer than "harbor"? Is an abstract word like "freedom" related to "bird" or not? This is not a very good problem for hard-coded rules.

Of course, you can build a word list, generate embeddings, calculate vector similarity, and then add another layer of rules to correct the results. But there is still another problem here: embeddings can solve part of the "are these two words related" question, but they do not always give the kind of human intuition this game actually wants.

For example, with "universe" and "alien," people usually feel they are obviously related. In this kind of game, the score would probably not be low. But if you only give this to embeddings, the result may not be stable in the way you want. It can calculate similarity, but that does not mean it can capture the feeling of "are you guessing in the right direction" inside this game.

So if your goal is just to make a playable first version, letting an LLM be the judge is much more convenient.

It feels like asking a friend with a lot of common sense in their head to host the game. This friend may not be perfectly objective every time, but usually has a pretty good intuition about whether one word is close to another. For a light game like this, that intuition is already enough.

And this host can already talk, so it is not only calculating a score. It can also handle multilingual input, fuzzy expressions, and even recognize whether the user wants to end the game. You can also write all of that with rules by yourself, but then it is easy for a "small fun thing" to slowly become "a product with many edge cases to handle carefully."

The hard part is keeping the model under control

This kind of thing looks very suitable for LLMs, but it also has an obvious risk: the model is too smart, and usually too eager to perform.

If you do not give it enough hard constraints, it can easily start hosting the game and also explaining too much, or even accidentally feeding the answer to you. Then the game is dead immediately.

So what I did later was put it inside a very narrow box.

The core rules I wrote in this skill look roughly like this:

- Secretly choose one hidden word before the game starts and keep it fixed for the entire session.
- If none of the guessed words is correct:
  Score each guessed word from 0 to 100 using human-like semantic relatedness.
- Merge all historical guesses with the current turn.
- Remove duplicates.
- Sort all guesses by score descending.
- Show at most the top 10 guesses.
- Do not explain the reasoning
- Do not give hints
- Do not reveal the answer

These constraints are actually more important than the sentence "make a word guessing game." The hard part here is not piling up functions. The hard part is making the model keep playing the same stable host under the same rules across many rounds of conversation. You need it to remember that the hidden word cannot change. You need the output format to stay stable every round. And you need to stop it from becoming too talkative. These are not the usual difficulties in traditional business code, but in skill design they become the main topic.

At this point, guess-the-word-game was not only a mini game for me. It also became a small prompt engineering experiment.

It does not have to be built with an LLM, but the first version is very suitable for an LLM

If I answer the original question strictly, then no, this kind of game is not something that can only be built with an LLM.

You can absolutely take a more traditional path: word list, vector representation, similarity calculation, and then another correction layer on top. If you really want to make it into a formal product with high fairness requirements and repeatable results, that path may actually be more reliable.

But what I was thinking about at that time was not how to make the most complete and most engineered version first. I wanted to verify as quickly as possible whether this gameplay worked at all. Under that goal, the advantage of the LLM becomes very obvious. It almost helped me skip the hardest layer: how to make "semantic closeness" feel roughly like human intuition in the first version.

So this game is not something that can "only" be built with an LLM. But if the goal is just to make the first version quickly, using an LLM is much easier.

Later I started thinking that maybe agents in the terminal can also play these small things

The more I thought about it later, the more I felt that if agents already work inside the terminal, maybe this environment can also hold some small playful things like this.

Before, when people talked about mini games, the first reaction was usually still web, app, or at least some kind of interface. But when you put it inside an agent, things suddenly become much simpler. It already lives inside a conversation. It already remembers context. It already takes turn-based input and gives feedback. Things like word guessing, Q&A, role play, or even some very light interactive gameplay can naturally fit into this shell.

After thinking about it like this, guess-the-word-game stopped feeling like only one small skill. It was also helping me verify something else: now that people are already interacting with agents in the terminal, can this space also hold things that are not that serious, but still genuinely fun?

That is also why I opened the weird-skills-lab repo. It is not for making a lot of "useful" tools. I just want to see whether this kind of environment can also hold some strange little skills.

This time I only made the smallest version first

This time I did not think about a leaderboard, sharing features, a nice UI, or a database. I only focused on the smallest few things first:

Keep one hidden word fixed, and make sure it does not drift across multiple rounds.
Lock the output format so the model cannot improvise too much.
Prepare a few test words that I can judge by myself, and check whether the semantic ranking is roughly human.

As long as these three things work, the gameplay is probably already playable. The rest of the engineering work can come later. This is also how I built guess-the-word-game this time. It is small enough, so I can verify the idea very directly.

In the end

After finishing this skill, the thought I kept was very simple.

I just happened to see this game, and then I tried it to see whether it could be placed inside the terminal, inside a skill, and inside an agent-style use case.

After this experiment, I feel LLM games are a pretty good direction. They do not always need to become formal products. Sometimes making a small skill like this, so people can casually play with it, is already interesting enough.

If this post gives someone a little inspiration, that is enough for me.

If you want to try it yourself, or just want to see how far these strange little skills can go, I open-sourced the project on GitHub: https://github.com/zmofei/weird-skills-lab .