Reinforcement Learning
Watch First
Learning Objectives
By the end of this lesson, you will be able to:
- Explain reinforcement learning as sequential decision-making under feedback.
- Define agent, environment, state, action, reward, policy, value, and return.
- Implement a small tabular Q-learning loop.
- Decide when RL is appropriate and when supervised learning or rules are safer.
Agent-Environment Loop
Reinforcement learning is about learning what to do through interaction.
In supervised learning, the model learns from labeled examples. In RL, an agent takes actions, observes rewards, and improves a policy over time.
Do not test risky RL behavior directly on real users. Start with simulation, offline evaluation, guardrails, and human review.
Core Concepts
| Concept | Meaning | Flow Research-style example |
|---|---|---|
| Agent | Learner that chooses actions | recommendation policy |
| Environment | System the agent interacts with | learning platform |
| State | Current situation | learner progress and recent activity |
| Action | Choice available to the agent | recommend lesson, ask mentor, wait |
| Reward | Feedback signal | completion, mastery, healthy engagement |
| Policy | Rule for choosing actions | `pi(a |
| Value | Expected future return | how promising a state or action is |
At timestep t:
The goal is to learn a policy that maximizes expected cumulative reward.
Return and Discounting
The return from timestep t is:
where gamma is the discount factor.
gammanear0values immediate reward.gammanear1values long-term outcomes.
In learning systems, long-term mastery may matter more than immediate clicks. That means reward design is not just math; it encodes product values.
Value Functions
The state-value function estimates how good it is to be in a state:
The action-value function estimates how good it is to take action a in state s:
Many RL algorithms learn or approximate Q(s, a).
Exploration vs. Exploitation
The agent must balance:
- exploitation: choose the action that currently looks best,
- exploration: try actions that might teach the agent something new.
Epsilon-greedy action selection is a simple strategy:
Exploration is useful in simulation. In real products, exploration needs safety constraints.
Q-Learning
Q-learning updates an estimate of action value:
where:
alphais the learning rate,gammais the discount factor,s'is the next state.
Q-Learning From Scratch
This tiny environment is a one-dimensional path. The agent starts at position 0 and gets reward when it reaches the final position.
import numpy as np
n_states = 6
n_actions = 2 # 0=left, 1=right
terminal_state = n_states - 1
Q = np.zeros((n_states, n_actions))
rng = np.random.default_rng(42)
alpha = 0.2
gamma = 0.95
epsilon = 0.2
episodes = 500
def step(state, action):
if action == 1:
next_state = min(state + 1, terminal_state)
else:
next_state = max(state - 1, 0)
reward = 1.0 if next_state == terminal_state else -0.01
done = next_state == terminal_state
return next_state, reward, done
for episode in range(episodes):
state = 0
done = False
while not done:
if rng.random() < epsilon:
action = rng.integers(n_actions)
else:
action = np.argmax(Q[state])
next_state, reward, done = step(state, action)
best_next = np.max(Q[next_state])
td_target = reward + gamma * best_next
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
state = next_state
print(np.round(Q, 2))
print("best actions:", np.argmax(Q, axis=1))
The learned policy should prefer moving right toward the terminal reward.
When RL Is Appropriate
Use RL when:
- actions affect future states,
- feedback arrives through rewards over time,
- you can simulate safely,
- static labels do not capture the decision problem,
- the cost of mistakes is controlled.
Avoid RL when:
- supervised labels are available and sufficient,
- the reward is easy to game,
- real-world exploration could harm users,
- the environment is too expensive or unsafe to simulate,
- a rule-based policy is easier and more reliable.
Flow Research-Style Use Cases
| Use case | Agent | State | Action | Reward |
|---|---|---|---|---|
| Learning path optimization | recommender | learner progress | next lesson | mastery and completion |
| Mentor triage | assistant | learner activity | notify mentor or wait | helpful intervention |
| Governance workflow | proposal router | proposal state | assign review path | timely, fair resolution |
Reward design is the hardest part. A reward that only tracks engagement can accidentally optimize for addictive behavior instead of learning.
Offline and Safe RL
For human-facing systems, start with:
- logged historical data,
- simulators,
- conservative policies,
- human-in-the-loop approval,
- shadow mode before actioning recommendations.
Practical Exercises
Exercise 1: Define an RL Problem
Choose a sequential process and define agent, environment, state, action, reward, and safety constraints.
Exercise 2: Modify Q-Learning
Change the reward, discount factor, and epsilon in the example. Observe how the learned policy changes.
Exercise 3: Reward Audit
Write a note explaining how your reward could be gamed and how you would guard against that.
Self-Assessment
Rate yourself from 1 to 5:
- I can explain how RL differs from supervised learning.
- I can read the return and Q-learning update equations.
- I can run a minimal Q-learning loop.
- I can identify RL safety risks before deployment.
Further Reading
Next Steps
Next, move into research practice. Advanced ML work is not just knowing architectures; it is learning how to replicate, evaluate, and extend research responsibly.