<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Reinforcement Learning on Learn by Tanhdev</title><link>https://learn.tanhdev.com/tags/reinforcement-learning/</link><description>Recent content in Reinforcement Learning on Learn by Tanhdev</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 08:00:00 +0700</lastBuildDate><atom:link href="https://learn.tanhdev.com/tags/reinforcement-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Preference Alignment: Thuật Toán DPO, KTO và GRPO</title><link>https://learn.tanhdev.com/series/slm-playbook/part-5-preference-alignment/</link><pubDate>Mon, 25 May 2026 08:00:00 +0700</pubDate><guid>https://learn.tanhdev.com/series/slm-playbook/part-5-preference-alignment/</guid><description>Tìm hiểu học tăng cường căn chỉnh LLMs. So sánh DPO, KTO và giải mã thuật toán GRPO của DeepSeek giúp tiết kiệm 50% GPU VRAM do không cần Critic Model.</description></item></channel></rss>