当前位置: 首页 > >

»úÆ÷ѧ*ÓëÉî¶Èѧ*ϵÁÐÁ¬ÔØ£º µÚÈý²¿·Ö Ç¿»¯Ñ§*£¨Ò») Ç¿»¯Ñ§*¼ò½é

发布时间:

Ç¿»¯Ñ§*¼ò½é

Deepmind ´óÉñDavid Silver½²¹ýÒ»¸ö¹«Ê½£º

Ëæ×ÅDeepMind¹«Ë¾¿ª·¢µÄAlphaGoÉý¼¶°æmasterսʤΧÆåÊÀ½ç¹Ú¾ü£¬Æä±³ºóÓ¦ÓõÄÇ¿»¯Ñ§*˼ÏëÊܵ½Á˹㷺¹Ø×¢£¬Ò²ÎüÒýÁËÒ»Åúϲ»¶Ë¼¿¼µÄС»ï°é£¬Ïëһ̽¾¿¾¹ÎªÊ²Ã´Ç¿»¯Ñ§*µÄÍþÁ¦Õâô´ó¡£±¾×¨À¸Ö÷ҪΧÈÆ**¡°´óÃû¶¦¶¦µÄΧÆå³ÌÐòmaster¡±**µÄÖ÷Òª×÷ÕßDavid SilverÖ÷½²µÄUCL-Course-Ç¿»¯Ñ§*ÊÓƵ¹«¿ª¿ÎºĮ́Íå´óѧÀîºêÒãÀÏʦµÄÉî¶ÈÇ¿»¯Ñ§*¿Î³Ì£¬½ÏΪϵͳ¡¢È«ÃæµØ½éÉÜÁËÇ¿»¯Ñ§*µÄ¸÷ÖÖ˼Ï롢ʵÏÖËã·¨¡£
ÍƼö½Ì²Ä£º


  An Introduction to Reinforcement Learning, Sutton and Barto, 1998

  Algorithms for Reinforcement Learning, Szepesvari, 2009


1. ¸ÅÄî

Ç¿»¯Ñ§*ÔÚ²»Í¬ÁìÓòÓв»Í¬µÄ±íÏÖÐÎʽ£ºÉñ¾­¿Æѧ¡¢ÐÄÀíѧ¡¢¼ÆËã»ú¿Æѧ¡¢¹¤³ÌÁìÓò¡¢Êýѧ¡¢¾­¼ÃѧµÈÓв»Í¬µÄ³Æºô¡£


Ç¿»¯Ñ§*ÊÇ»úÆ÷ѧ*µÄÒ»¸ö·ÖÖ§£º¼à¶½Ñ§*¡¢Î޼ලѧ*¡¢Ç¿»¯Ñ§*


Ç¿»¯Ñ§*µÄÌص㣺


ûÓмලÊý¾Ý¡¢Ö»Óн±ÀøÐźÅ

½±ÀøÐźŲ»Ò»¶¨ÊÇʵʱµÄ£¬¶øºÜ¿ÉÄÜÊÇÑÓºóµÄ£¬ÓÐʱÉõÖÁÑÓºóºÜ¶à¡£

ʱ¼ä£¨ÐòÁУ©ÊÇÒ»¸öÖØÒªÒòËØ

µ±Ç°µÄÐÐΪӰÏìºóÐø½ÓÊÕµ½µÄÊý¾Ý


Ç¿»¯Ñ§*Óй㷺µÄÓ¦ÓãºÏñÖ±Éý»úÌؼ¼·ÉÐУ¨Îâ¶÷´ï»úÆ÷ѧ*¿Î³Ì×îºó²¿·ÖÓÐÉæ¼°£©¡¢¾­µäÓÎÏ·¡¢Í¶×ʹÜÀí¡¢·¢µçÕ¾¿ØÖÆ¡¢ÈûúÆ÷ÈËÄ£·ÂÈËÀàÐÐ×ߵȡ£


2. Ç¿»¯Ñ§*µÄÎÊÌâ
£¨1£© ½±Àø Reward

ÊÇÐźŵķ´À¡£¬ÊÇÒ»¸ö±êÁ¿£¬Ëü·´Ó³¸öÌåÔÚtʱ¿Ì×öµÃÔõôÑù¡£¸öÌåµÄ¹¤×÷¾ÍÊÇ×î´ó»¯ÀÛ¼*±Àø¡£








R


t




R_{t}


Rt? ÊÇÇ¿»¯Ñ§*Ö÷Òª»ùÓÚÕâÑùµÄ¡±½±Àø¼ÙÉ衱£ºËùÓÐÎÊÌâ½â¾öµÄÄ¿±ê¶¼¿ÉÒÔ±»ÃèÊö³É×î´ó»¯ÀÛ»ý½±Àø¡£


£¨2£©ÐòÁоö²ß Sequential Decision Making

Ä¿±ê£ºÑ¡ÔñÒ»¶¨µÄÐÐΪϵÁÐÒÔ×î´ó»¯Î´À´µÄ×ÜÌå½±Àø


ÕâЩÐÐΪ¿ÉÄÜÊÇÒ»¸ö³¤ÆÚµÄÐòÁÐ


½±Àø¿ÉÄܶøÇÒͨ³£ÊÇÑÓ³ÙµÄ


ÓÐʱºòÄþÔ¸ÎþÉü¼´Ê±£¨¶ÌÆÚ£©µÄ½±ÀøÒÔ»ñÈ¡¸ü¶àµÄ³¤ÆÚ½±Àø£¨ÊÇÒ»¸öÓÐÔ¶¼ûµÄËã·¨£¬ÁîÈËÅå·þ£¡£©


£¨3£©¸öÌåºÍ»·¾³ Agent & Environment

ÔÚ t ʱ¿Ì:
Agent ¸öÌå¿ÉÒÔ£º


  ÓÐÒ»¸ö¶ÔÓÚ»·¾³µÄ¹Û²ìÆÀ¹À





  O


  t




  O_{t}


  Ot?£¬×ö³öÒ»¸öÐÐΪ





  A


  t




  A_{t}


  At?£¬´Ó»·¾³µÃµ½Ò»¸ö½±ÀøÐźÅ





  R


  t




  R_{t}


  Rt?¡£

»·¾³¿ÉÒÔ£º


  ½ÓÊÕ¸öÌåµÄ¶¯×÷





  A


  t




  A_{t}


  At?£¬¸üл·¾³ÐÅÏ¢£¬Í¬Ê±Ê¹µÃ¸öÌå¿ÉÒԵõ½ÏÂÒ»¸ö¹Û²â





  O



  t


  +


  1





  O_{t+1}


  Ot+1?£¬¸ø¸öÌåÒ»¸ö½±ÀøÐźÅ





  R



  t


  +


  1





  R_{t+1}


  Rt+1?

£¨3£©ÀúÊ·ºÍ״̬ History & State
ÀúÊ· History

ÀúÊ·Êǹ۲⡢ÐÐΪ¡¢½±ÀøµÄÐòÁУº





H


t



=



O


1



,



R


1



,



A


1



,


.


.


.


,



O



t


?


1




,



R



t


?


1




,



A



t


?


1




,



O


t



,



R


t



,



A


t




H_{t} = O_{1}, R_{1}, A_{1},..., O_{t-1}, R_{t-1}, A_{t-1}, O_{t}, R_{t}, A_{t}


Ht?=O1?,R1?,A1?,...,Ot?1?,Rt?1?,At?1?,Ot?,Rt?,At?


״̬ State

״̬ÊÇËùÓоö¶¨½«À´µÄÒÑÓеÄÐÅÏ¢£¬ÊǹØÓÚÀúÊ·µÄÒ»¸öº¯Êý£º





S


t



=


f


(



H


t



)



S_{t} = f(H_{t})


St?=f(Ht?)


»·¾³×´Ì¬

ÊÇ»·¾³µÄ˽ÓгÊÏÖ£¬°üÀ¨»·¾³ÓÃÀ´¾ö¶¨ÏÂÒ»¸ö¹Û²â/½±ÀøµÄËùÓÐÊý¾Ý£¬Í¨³£¶Ô¸öÌå²¢²»ÍêÈ«¿É¼û£¬Ò²¾ÍÊǸöÌåÓÐʱºò²¢²»ÖªµÀ»·¾³×´Ì¬µÄËùÓÐϸ½Ú¡£¼´Ê¹ÓÐʱºò»·¾³×´Ì¬¶Ô¸öÌå¿ÉÒÔÊÇÍêÈ«¿É¼ûµÄ£¬ÕâЩÐÅÏ¢Ò²¿ÉÄÜ°üº¬×ÅһЩÎÞ¹ØÐÅÏ¢¡£


¸öÌå״̬

ÊǸöÌåµÄÄÚ²¿³ÊÏÖ£¬°üÀ¨¸öÌå¿ÉÒÔʹÓõġ¢¾ö¶¨Î´À´¶¯×÷µÄËùÓÐÐÅÏ¢¡£¸öÌå״̬ÊÇÇ¿»¯Ñ§*Ëã·¨¿ÉÒÔÀûÓõÄÐÅÏ¢£¬Ëü¿ÉÒÔÊÇÀúÊ·µÄÒ»¸öº¯Êý£º





S


t


a



=


f


(



H


t



)



S^{a}_{t} = f(H_{t})


Sta?=f(Ht?)


ÐÅϢ״̬

°üÀ¨ÀúÊ·ÉÏËùÓÐÓÐÓõÄÐÅÏ¢£¬ÓÖ³ÆMarkov״̬¡£


Âí¶ù¿É·òÊôÐÔ Markov Property
Ò»¸ö״̬StÊÇÂí¶û¿É·òµÄ£¬µ±ÇÒ½öµ±£º




P


[



S



t


+


1




?



S


t



]


=


P


[



S



t


+


1




?



S


1



,



S


2



,


.


.


.


,



S


t



]



P[S_{t+1} | S_{t}] = P[S_{t+1} | S_{1}, S_{2},..., S_{t}]


P[St+1??St?]=P[St+1??S1?,S2?,...,St?]

Ò²¾ÍÊÇ˵£¬Èç¹ûÐÅϢ״̬ÊÇ¿ÉÖªµÄ£¬ÄÇôËùÓÐÀúÊ·ÐÅÏ¢¶¼¿ÉÒÔ¶ªµô£¬½öÐèÒª t ʱ¿ÌµÄÐÅϢ״̬¾Í¿ÉÒÔÁË¡£ÀýÈ磺»·¾³×´Ì¬ÊÇMarkovµÄ£¬ÒòΪ»·¾³×´Ì¬ÊÇ»·¾³°üº¬ÁË»·¾³¾ö¶¨ÏÂÒ»¸ö¹Û²â/½±ÀøµÄËùÓÐÐÅÏ¢£»Í¬Ñù£¬£¨ÍêÕûµÄ£©ÀúÊ· H_{t} Ò²ÊÇÂí¶û¿É·òµÄ
ʾÀý??Âí¶ù¿É·òÐÔ


ÓÐÈçÏÂÈý¸öÕë¶ÔÀÏÊóµÄʼþÐòÁУ¬ÆäÖÐÇ°Á½¸ö×îºóµÄʼþ·Ö±ðÊÇÀÏÊóÔâµç»÷ºÍ»ñµÃÒ»¿éÄÌÀÒ£¬ÏÖÔÚÇë·ÖÎö±È½ÏÕâÈý¸öʼþÐòÁеÄÌص㣬·ÖÎöµÚµÚÈý¸öʼþÐòÁÐÖУ¬ÀÏÊóÊÇ»ñµÃµç»÷»¹ÊÇÄÌÀÒ£¿


¼ÙÈç¸öÌå״̬ = ÐòÁÐÖеĺóÈý¸öʼþ£¨²»°üÀ¨µç»÷¡¢»ñµÃÄÌÀÒ£¬ÏÂͬ£©£¬Ê¼þÐòÁÐ3µÄ½á¹û»áÊÇʲô£¿£¨´ð°¸ÊÇ£ºµç»÷£©


¼ÙÈç¸öÌå״̬ = ÁÁµÆ¡¢ÏìÁåºÍÀ­µçÕ¢¸÷×Ôʼþ·¢ÉúµÄ´ÎÊý£¬ÄÇôʼþÐòÁÐ3µÄ½á¹ûÓÖÊÇʲô£¿£¨ÄÌÀÒ£©
¼ÙÈç¸öÌå״̬ = ÍêÕûµÄʼþÐòÁУ¬Äǽá¹ûÓÖÊÇʲô£¿£¨Î´Öª£©


£¨4£©ÍêÈ«¿É¹Û²âµÄ»·¾³ Fully Observable Environments

¸öÌåÄܹ»Ö±½Ó¹Û²âµ½»·¾³×´Ì¬¡£ÔÚÕâÖÖÌõ¼þÏÂ:


¸öÌå¶Ô»·¾³µÄ¹Û²â = ¸öÌå״̬ = »·¾³×´Ì¬


ÕýʽµØ˵£¬ÕâÖÖÎÊÌâÊÇÒ»¸öÂí¶ù¿É·ò¾ö¶¨¹ý³Ì£¨Markov Decision Process£¬ MDP£©


£¨5£©²¿·Ö¿É¹Û²âµÄ»·¾³ Partially Observable Environments

¸öÌå¼ä½Ó¹Û²â»·¾³¡£¾ÙÁ˼¸¸öÀý×Ó£º


Ò»¸ö¿ÉÅÄÕյĻúÆ÷È˸öÌå¶ÔÓÚÆäÖÜΧ»·¾³µÄ¹Û²â²¢²»ÄÜ˵Ã÷Æä¾ø¶ÈλÖã¬Ëü±ØÐë×Ô¼ºÈ¥¹À¼Æ×Ô¼ºµÄ¾ø¶ÔλÖ㬶ø¾ø¶ÔλÖÃÔòÊǷdz£ÖØÒªµÄ»·¾³×´Ì¬ÌØÕ÷Ö®Ò»£»
Ò»¸ö½»Ò×Ô±Ö»ÄÜ¿´µ½µ±Ç°µÄ½»Ò×¼Û¸ñ£»
Ò»¸öÆË¿ËÅÆÍæ¼ÒÖ»ÄÜ¿´µ½×Ô¼ºµÄÅƺÍÆäËûÒѾ­³ö¹ýµÄÅÆ£¬¶ø²»ÖªµÀÕû¸ö»·¾³£¨°üÀ¨¶ÔÊÖµÄÅÆ£©×´Ì¬¡£
ÔÚÕâÖÖÌõ¼þÏ£º


¸öÌå״̬ ¡Ù »·¾³×´Ì¬


ÕýʽµØ˵£¬ÕâÖÖÎÊÌâÊÇÒ»¸ö²¿·Ö¿É¹Û²âÂí¶ù¿É·ò¾ö²ß¹ý³Ì¡£¸öÌå±ØÐë¹¹½¨Ëü×Ô¼ºµÄ״̬³ÊÏÖÐÎʽ£¬±ÈÈ磺¼ÇסÍêÕûµÄÀúÊ·£º





S


t


a



=



H


t




S^{a}_{t} = H_{t}


Sta?=Ht?


ÕâÖÖ·½·¨±È½Ïԭʼ¡¢Ó×ÖÉ¡£»¹ÓÐÆäËû°ì·¨£¬ÀýÈç £º


  Beliefs of environment state£º´ËʱËäÈ»¸öÌå²»ÖªµÀ»·¾³×´Ì¬µ½µ×ÊÇʲôÑù£¬µ«¸öÌå¿ÉÒÔÀûÓÃÒÑÓо­Ñ飨Êý¾Ý£©£¬Óø÷ÖÖ¸öÌåÒÑ֪״̬µÄ¸ÅÂÊ·Ö²¼×÷Ϊµ±Ç°Ê±¿ÌµÄ¸öÌå״̬µÄ³ÊÏÖ£º

  Recurrent neural network£º²»ÐèÒªÖªµÀ¸ÅÂÊ£¬Ö»¸ù¾Ýµ±Ç°µÄ¸öÌå״̬ÒÔ¼°µ±Ç°Ê±¿Ì¸öÌåµÄ¹Û²â£¬ËÍÈëÑ­»·Éñ¾­ÍøÂç(RNN)Öеõ½Ò»¸öµ±Ç°¸öÌå״̬µÄ³ÊÏÖ


3. Ç¿»¯Ñ§*¸öÌåµÄÖ÷Òª×é³É²¿·Ö

Ç¿»¯Ñ§*ÖеĸöÌå¿ÉÒÔÓÉÒÔÏÂÈý¸ö×é³É²¿·ÖÖеÄÒ»¸ö»ò¶à¸ö×é³É£º


(1)²ßÂÔ Policy

²ßÂÔÊǾö¶¨¸öÌåÐÐΪµÄ»úÖÆ¡£ÊÇ´Ó״̬µ½ÐÐΪµÄÒ»¸öÓ³É䣬¿ÉÒÔÊÇÈ·¶¨ÐԵģ¬Ò²¿ÉÒÔÊDz»È·¶¨ÐԵġ£


(2)¼ÛÖµº¯Êý Value Function

ÊÇÒ»¸öδÀ´½±ÀøµÄÔ¤²â£¬ÓÃÀ´ÆÀ¼Ûµ±Ç°×´Ì¬µÄºÃ»µ³Ì¶È¡£µ±Ãæ¶ÔÁ½¸ö²»Í¬µÄ״̬ʱ£¬¸öÌå¿ÉÒÔÓÃÒ»¸öValueÖµÀ´ÆÀ¹ÀÕâÁ½¸ö״̬¿ÉÄÜ»ñµÃµÄ×îÖÕ½±ÀøÇø±ð£¬¼Ì¶øÖ¸µ¼Ñ¡Ôñ²»Í¬µÄÐÐΪ£¬¼´Öƶ¨²»Í¬µÄ²ßÂÔ¡£Í¬Ê±£¬Ò»¸ö¼ÛÖµº¯ÊýÊÇ»ùÓÚijһ¸öÌض¨²ßÂԵģ¬²»Í¬µÄ²ßÂÔÏÂͬһ״̬µÄ¼ÛÖµ²¢²»Ïàͬ¡£Ä³Ò»²ßÂÔϵļÛÖµº¯ÊýÓÃÏÂʽ±íʾ£º


(3)Ä£ÐÍ Model

¸öÌå¶Ô»·¾³µÄÒ»¸ö½¨Ä££¬ËüÌåÏÖÁ˸öÌåÊÇÈçºÎ˼¿¼»·¾³ÔËÐлúÖƵģ¨how the agent think what the environment was.£©£¬¸öÌåÏ£ÍûÄ£ÐÍÄÜÄ£Äâ»·¾³Óë¸öÌåµÄ½»»¥»úÖÆ¡£


Ä£ÐÍÖÁÉÙÒª½â¾öÁ½¸öÎÊÌ⣺һÊÇ״̬ת»¯¸ÅÂÊ£¬¼´Ô¤²âÏÂÒ»¸ö¿ÉÄÜ״̬·¢ÉúµÄ¸ÅÂÊ£º


ÁíÒ»Ï×÷ÊÇÔ¤²â¿ÉÄÜ»ñµÃµÄ¼´Ê±½±Àø£º


Ä£ÐͲ¢²»Êǹ¹½¨Ò»¸ö¸öÌåËù±ØÐèµÄ£¬ºÜ¶àÇ¿»¯Ñ§*Ëã·¨ÖиöÌå²¢²»ÊÔͼ£¨ÒÀÀµ£©¹¹½¨Ò»¸öÄ£ÐÍ¡£


×¢£ºÄ£ÐͽöÕë¶Ô¸öÌå¶øÑÔ£¬»·¾³Êµ¼ÊÔËÐлúÖƲ»³ÆΪģÐÍ£¬¶ø³ÆΪ»·¾³¶¯Á¦Ñ§(dynamics of environment)£¬ËüÄܹ»Ã÷È·È·¶¨¸öÌåÏÂÒ»¸ö״̬ºÍËùµÃµÄ¼´Ê±½±Àø¡£


4. Ç¿»¯Ñ§*¸öÌåµÄ·ÖÀà

½â¾öÇ¿»¯Ñ§*ÎÊÌ⣬¸öÌå¿ÉÒÔÓжàÖÖ¹¤¾ß×éºÏ£¬±ÈÈçͨ¹ý½¨Á¢¶Ô״̬µÄ¼ÛÖµµÄ¹À¼ÆÀ´½â¾öÎÊÌ⣬»òÕßͨ¹ýÖ±½Ó½¨Á¢¶Ô²ßÂԵĹÀ¼ÆÀ´½â¾öÎÊÌâ¡£ÕâЩ¶¼ÊǸöÌå¿ÉÒÔʹÓõŤ¾ßÏäÀïµÄ¹¤¾ß¡£Òò´Ë£¬¸ù¾Ý¸öÌåÄÚ°üº¬µÄ¡°¹¤¾ß¡±½øÐзÖÀ࣬¿ÉÒ԰ѸöÌå·ÖΪÈçÏÂÈýÀࣺ


  ½ö»ùÓÚ¼ÛÖµº¯ÊýµÄ Value Based£ºÔÚÕâÑùµÄ¸öÌåÖУ¬ÓжÔ״̬µÄ¼ÛÖµ¹À¼Æº¯Êý£¬µ«ÊÇûÓÐÖ±½ÓµÄ²ßÂÔº¯Êý£¬²ßÂÔº¯ÊýÓɼÛÖµº¯Êý¼ä½ÓµÃµ½¡£½öÖ±½Ó»ùÓÚ²ßÂ﵀ Policy Based£ºÕâÑùµÄ¸öÌåÖÐÐÐΪֱ½ÓÓɲßÂÔº¯Êý²úÉú£¬¸öÌå²¢²»Î¬»¤Ò»¸ö¶Ô¸÷״̬¼ÛÖµµÄ¹À¼Æº¯Êý¡£ÑÝÔ±-ÆÀÅмÒÐÎʽ Actor-Critic£º¸öÌå¼ÈÓмÛÖµº¯Êý¡¢Ò²ÓвßÂÔº¯Êý¡£Á½ÕßÏ໥½áºÏ½â¾öÎÊÌâ¡£
  ´ËÍ⣬¸ù¾Ý¸öÌåÔÚ½â¾öÇ¿»¯Ñ§*ÎÊÌâʱÊÇ·ñ½¨Á¢Ò»¸ö¶Ô»·¾³¶¯Á¦Ñ§µÄÄ£ÐÍ£¬½«Æä·ÖΪÁ½´óÀࣺ

²»»ùÓÚÄ£Ð͵ĸöÌå: ÕâÀà¸öÌå²¢²»ÊÓͼÁ˽⻷¾³ÈçºÎ¹¤×÷£¬¶ø½ö¾Û½¹ÓÚ¼ÛÖµºÍ/»ò²ßÂÔº¯Êý¡£
»ùÓÚÄ£Ð͵ĸöÌ壺¸öÌå³¢ÊÔ½¨Á¢Ò»¸öÃèÊö»·¾³ÔË×÷¹ý³ÌµÄÄ£ÐÍ£¬ÒÔ´ËÀ´Ö¸µ¼¼ÛÖµ»ò²ßÂÔº¯ÊýµÄ¸üС£


±¾×¨À¸Í¼Æ¬¡¢¹«Ê½ºÜ¶àÀ´×ÔDavid SilverÖ÷½²µÄUCL-CourseÇ¿»¯Ñ§*ÊÓƵ¹«¿ª¿ÎºĮ́Íå´óѧÀîºêÒãÀÏʦµÄÉî¶ÈÇ¿»¯Ñ§*¿Î³Ì,ÔÚÕâÀ¸ÐлÕâЩ¾­µä¿Î³Ì£¬ÏòËûÃÇÖ¾´£¡



友情链接: