Reinforcement learning is how a computer learns by trial and error, kind of like training a dog with treats.

Imagine you’re learning a video game you’ve never played. Nobody hands you a rulebook. You just start pressing buttons. Some moves earn points (good!) and some get you killed (bad!). Over time, you figure out which actions lead to rewards and which lead to disaster, and you start playing better. You’re not memorizing instructions, you’re learning from the consequences of what you try.

That’s exactly what reinforcement learning is for a computer program, usually called an “agent.” It works like this:

  • The agent is the learner (the player).
  • The environment is the world it acts in (the game).
  • The agent takes an action, and the environment responds with a new situation plus a reward (points up or down).
  • The agent’s goal is to rack up the most reward over time.

The tricky part is balancing two instincts. Sometimes you want to exploit what you already know works (use the move that always scores). Other times you should explore something new that might work even better. Too much of either and you lose, so the agent has to mix them.

One more wrinkle: rewards are often delayed. In chess, a move you make early might be why you win twenty moves later. The agent has to learn that some actions pay off way down the line, not instantly. This is why a single bad decision isn’t always obvious until much later, and learning to connect those dots is a big part of the challenge.

This is how computers learned to beat world champions at Go, how robots learn to walk, and part of how AI chatbots get fine-tuned to give better answers.