kb/AIXI-1.md at deea1695831fb31a41f25febc8dbcdfb185cd42d

turtle89431 1a02be5a8e Scrape wikipedia-science: 16831 new, 4190 updated, 21574 total (kb-cron)

2026-05-05 07:38:32 -07:00

14 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
AIXI	2/2	https://en.wikipedia.org/wiki/AIXI	reference	science, encyclopedia	2026-05-05T14:37:17.710025+00:00	kb-cron

Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to

    m
    −
    t
  

{\displaystyle m-t}

time steps ahead (that is, from

    t
  

{\displaystyle t}

    m
  

{\displaystyle m}

), weighs each of them by the complexity of programs

    q
  

{\displaystyle q}

(that is, by

      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
  

{\displaystyle 2^{-{\textrm {length}}(q)}}

) consistent with the agent's past (that is, the previously executed actions,

      a
      
        <
        t
      
    
  

{\displaystyle a_{<t}}

, and received percepts,

      e
      
        <
        t
      
    
  

{\displaystyle e_{<t}}

) that can generate that future, and then picks the action that maximizes expected future rewards. Let us break this definition down in order to attempt to fully understand it.

      o
      
        t
      
    
    
      r
      
        t
      
    
  

{\displaystyle o_{t}r_{t}}

is the "percept" (which consists of the observation

      o
      
        t
      
    
  

{\displaystyle o_{t}}

and reward

      r
      
        t
      
    
  

{\displaystyle r_{t}}

) received by the AIXI agent at time step

    t
  

{\displaystyle t}

from the environment (which is unknown and stochastic). Similarly,

      o
      
        m
      
    
    
      r
      
        m
      
    
  

{\displaystyle o_{m}r_{m}}

is the percept received by AIXI at time step

    m
  

{\displaystyle m}

(the last time step where AIXI is active).

      r
      
        t
      
    
    +
    …
    +
    
      r
      
        m
      
    
  

{\displaystyle r_{t}+\ldots +r_{m}}

is the sum of rewards from time step

    t
  

{\displaystyle t}

to time step

    m
  

{\displaystyle m}

, so AIXI needs to look into the future to choose its action at time step

    t
  

{\displaystyle t}

    U
  

{\displaystyle U}

denotes a monotone universal Turing machine, and

    q
  

{\displaystyle q}

ranges over all (deterministic) programs on the universal machine

    U
  

{\displaystyle U}

, which receives as input the program

    q
  

{\displaystyle q}

and the sequence of actions

      a
      
        1
      
    
    …
    
      a
      
        m
      
    
  

{\displaystyle a_{1}\dots a_{m}}

(that is, all actions), and produces the sequence of percepts

      o
      
        1
      
    
    
      r
      
        1
      
    
    …
    
      o
      
        m
      
    
    
      r
      
        m
      
    
  

{\displaystyle o_{1}r_{1}\ldots o_{m}r_{m}}

. The universal Turing machine

    U
  

{\displaystyle U}

is thus used to "simulate" or compute the environment responses or percepts, given the program

    q
  

{\displaystyle q}

(which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the current and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown.

        length
      
    
    (
    q
    )
  

{\displaystyle {\textrm {length}}(q)}

is the length of the program

    q
  

{\displaystyle q}

(which is encoded as a string of bits). Note that

      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
    =
    
      
        1
        
          2
          
            
              
                length
              
            
            (
            q
            )
          
        
      
    
  

{\displaystyle 2^{-{\textrm {length}}(q)}={\frac {1}{2^{{\textrm {length}}(q)}}}}

. Hence, in the definition above,

      ∑
      
        q
        :
        
        U
        (
        q
        ,
        
          a
          
            1
          
        
        …
        
          a
          
            m
          
        
        )
        =
        
          o
          
            1
          
        
        
          r
          
            1
          
        
        …
        
          o
          
            m
          
        
        
          r
          
            m
          
        
      
    
    
      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
  

{\displaystyle \sum _{q:\;U(q,a_{1}\ldots a_{m})=o_{1}r_{1}\ldots o_{m}r_{m}}2^{-{\textrm {length}}(q)}}

should be interpreted as a mixture (in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity

      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
  

{\displaystyle 2^{-{\textrm {length}}(q)}}

. Note that

      a
      
        1
      
    
    …
    
      a
      
        m
      
    
  

{\displaystyle a_{1}\ldots a_{m}}

can also be written as

      a
      
        1
      
    
    …
    
      a
      
        t
        −
        1
      
    
    
      a
      
        t
      
    
    …
    
      a
      
        m
      
    
  

{\displaystyle a_{1}\ldots a_{t-1}a_{t}\ldots a_{m}}

, and

      a
      
        1
      
    
    …
    
      a
      
        t
        −
        1
      
    
    =
    
      a
      
        <
        t
      
    
  

{\displaystyle a_{1}\ldots a_{t-1}=a_{<t}}

is the sequence of actions already executed in the environment by the AIXI agent. Similarly,

      o
      
        1
      
    
    
      r
      
        1
      
    
    …
    
      o
      
        m
      
    
    
      r
      
        m
      
    
    =
    
      o
      
        1
      
    
    
      r
      
        1
      
    
    …
    
      o
      
        t
        −
        1
      
    
    
      r
      
        t
        −
        1
      
    
    
      o
      
        t
      
    
    
      r
      
        t
      
    
    …
    
      o
      
        m
      
    
    
      r
      
        m
      
    
  

{\displaystyle o_{1}r_{1}\ldots o_{m}r_{m}=o_{1}r_{1}\ldots o_{t-1}r_{t-1}o_{t}r_{t}\ldots o_{m}r_{m}}

, and

      o
      
        1
      
    
    
      r
      
        1
      
    
    …
    
      o
      
        t
        −
        1
      
    
    
      r
      
        t
        −
        1
      
    
  

{\displaystyle o_{1}r_{1}\ldots o_{t-1}r_{t-1}}

is the sequence of percepts produced by the environment so far. Let us now put all these components together in order to understand this equation or definition. At time step t, AIXI chooses the action

      a
      
        t
      
    
  

{\displaystyle a_{t}}

where the function

      ∑
      
        
          o
          
            t
          
        
        
          r
          
            t
          
        
      
    
    …
    
      max
      
        
          a
          
            m
          
        
      
    
    
      ∑
      
        
          o
          
            m
          
        
        
          r
          
            m
          
        
      
    
    [
    
      r
      
        t
      
    
    +
    …
    +
    
      r
      
        m
      
    
    ]
    
      ∑
      
        q
        :
        
        U
        (
        q
        ,
        
          a
          
            1
          
        
        …
        
          a
          
            m
          
        
        )
        =
        
          o
          
            1
          
        
        
          r
          
            1
          
        
        …
        
          o
          
            m
          
        
        
          r
          
            m
          
        
      
    
    
      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
  

{\displaystyle \sum _{o_{t}r_{t}}\ldots \max _{a_{m}}\sum _{o_{m}r_{m}}[r_{t}+\ldots +r_{m}]\sum _{q:\;U(q,a_{1}\ldots a_{m})=o_{1}r_{1}\ldots o_{m}r_{m}}2^{-{\textrm {length}}(q)}}

attains its maximum.

=== Parameters === The parameters to AIXI are the universal Turing machine U and the agent's lifetime m, which need to be chosen. The latter parameter can be removed by the use of discounting.

== Optimality == AIXI's performance is measured by the expected total number of rewards it receives. AIXI has been proven to be optimal in the following ways.

Pareto optimality: there is no other agent that performs at least as well as AIXI in all environments while performing strictly better in at least one environment. Balanced Pareto optimality: like Pareto optimality, but considering a weighted sum of environments. Self-optimizing: a policy p is called self-optimizing for an environment

    μ
  

{\displaystyle \mu }

if the performance of p approaches the theoretical maximum for

    μ
  

{\displaystyle \mu }

when the length of the agent's lifetime (not time) goes to infinity. For environment classes where self-optimizing policies exist, AIXI is self-optimizing. It was later shown by Hutter and Jan Leike that balanced Pareto optimality is subjective and that any policy can be considered Pareto optimal, which they describe as undermining all previous optimality claims for AIXI. However, AIXI does have limitations. It is restricted to maximizing rewards based on percepts as opposed to external states. It also assumes it interacts with the environment solely through action and percept channels, preventing it from considering the possibility of being damaged or modified. Colloquially, this means that it doesn't consider itself to be contained by the environment it interacts with. It also assumes the environment is computable.

== Computational aspects == Like Solomonoff induction, AIXI is incomputable. However, there are computable approximations of it. One such approximation is AIXItl, which performs at least as well as the provably best time t and space l limited agent. Another approximation to AIXI with a restricted environment class is MC-AIXI (FAC-CTW) (which stands for Monte Carlo AIXI FAC-Context-Tree Weighting), which has had some success playing simple games such as partially observable Pac-Man.

== See also == Gödel machine

== References ==

"Universal Algorithmic Intelligence: A mathematical top->down approach", Marcus Hutter, arXiv:cs/0701125; also in Artificial General Intelligence, eds. B. Goertzel and C. Pennachin, Springer, 2007, ISBN 9783540237334, pp. 227–290, doi:10.1007/978-3-540-68677-4_8.

14 KiB Raw Blame History Unescape Escape

14 KiB

Raw Blame History