kb/AIXI-0.md at bfa66fc0f8d5fd91d2e04bab963dcf8e44d1f02e

turtle89431 1a02be5a8e Scrape wikipedia-science: 16831 new, 4190 updated, 21574 total (kb-cron)

2026-05-05 07:38:32 -07:00

15 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
AIXI	1/2	https://en.wikipedia.org/wiki/AIXI	reference	science, encyclopedia	2026-05-05T14:37:17.710025+00:00	kb-cron

AIXI is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book Universal Artificial Intelligence. AIXI is a reinforcement learning (RL) agent. It maximizes the expected total rewards received from the environment. Intuitively, it simultaneously considers every computable hypothesis (or environment). In each time step, it looks at every possible program and evaluates how many rewards that program generates depending on the next action taken. The promised rewards are then weighted by the subjective belief that this program constitutes the true environment. This belief is computed from the length of the program: longer programs are considered less likely, in line with Occam's razor. AIXI then selects the action that has the highest expected total reward in the weighted sum of all these programs.

== Etymology == According to Hutter, the word "AIXI" can have several interpretations. AIXI can stand for AI based on Solomonoff's distribution, denoted by

    ξ
  

{\displaystyle \xi }

(which is the Greek letter xi), or e.g. it can stand for AI "crossed" (X) with induction (I). There are other interpretations.

== Definition == AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment

    μ
  

{\displaystyle \mu }

. The interaction proceeds in time steps, from

    t
    =
    1
  

{\displaystyle t=1}

    t
    =
    m
  

{\displaystyle t=m}

, where

    m
    ∈
    
      N
    
  

{\displaystyle m\in \mathbb {N} }

is the lifespan of the AIXI agent. At time step t, the agent chooses an action

      a
      
        t
      
    
    ∈
    
      
        A
      
    
  

{\displaystyle a_{t}\in {\mathcal {A}}}

(e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept"

      e
      
        t
      
    
    ∈
    
      
        E
      
    
    =
    
      
        O
      
    
    ×
    
      R
    
  

{\displaystyle e_{t}\in {\mathcal {E}}={\mathcal {O}}\times \mathbb {R} }

, which consists of an "observation"

      o
      
        t
      
    
    ∈
    
      
        O
      
    
  

{\displaystyle o_{t}\in {\mathcal {O}}}

(e.g., a camera image) and a reward

      r
      
        t
      
    
    ∈
    
      R
    
  

{\displaystyle r_{t}\in \mathbb {R} }

, distributed according to the conditional probability

    μ
    (
    
      o
      
        t
      
    
    
      r
      
        t
      
    
    
      |
    
    
      a
      
        1
      
    
    
      o
      
        1
      
    
    
      r
      
        1
      
    
    .
    .
    .
    
      a
      
        t
        −
        1
      
    
    
      o
      
        t
        −
        1
      
    
    
      r
      
        t
        −
        1
      
    
    
      a
      
        t
      
    
    )
  

{\displaystyle \mu (o_{t}r_{t}|a_{1}o_{1}r_{1}...a_{t-1}o_{t-1}r_{t-1}a_{t})}

, where

      a
      
        1
      
    
    
      o
      
        1
      
    
    
      r
      
        1
      
    
    .
    .
    .
    
      a
      
        t
        −
        1
      
    
    
      o
      
        t
        −
        1
      
    
    
      r
      
        t
        −
        1
      
    
    
      a
      
        t
      
    
  

{\displaystyle a_{1}o_{1}r_{1}...a_{t-1}o_{t-1}r_{t-1}a_{t}}

is the "history" of actions, observations and rewards. The environment

    μ
  

{\displaystyle \mu }

is thus mathematically represented as a probability distribution over "percepts" (observations and rewards) which depend on the full history, so there is no Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is unknown to the AIXI agent. Furthermore, note again that

    μ
  

{\displaystyle \mu }

is computable, that is, the observations and rewards received by the agent from the environment

    μ
  

{\displaystyle \mu }

can be computed by some program (which runs on a Turing machine), given the past actions of the AIXI agent. The only goal of the AIXI agent is to maximize

      ∑
      
        t
        =
        1
      
      
        m
      
    
    
      r
      
        t
      
    
  

{\displaystyle \sum _{t=1}^{m}r_{t}}

, that is, the sum of rewards from time step 1 to m. The AIXI agent is associated with a stochastic policy

    π
    :
    (
    
      
        A
      
    
    ×
    
      
        E
      
    
    
      )
      
        ∗
      
    
    →
    
      
        A
      
    
  

{\displaystyle \pi :({\mathcal {A}}\times {\mathcal {E}})^{*}\rightarrow {\mathcal {A}}}

, which is the function it uses to choose actions at every time step, where

        A
      
    
  

{\displaystyle {\mathcal {A}}}

is the space of all possible actions that AIXI can take and

        E
      
    
  

{\displaystyle {\mathcal {E}}}

is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution)

    μ
  

{\displaystyle \mu }

can also be thought of as a stochastic policy (which is a function):

    μ
    :
    (
    
      
        A
      
    
    ×
    
      
        E
      
    
    
      )
      
        ∗
      
    
    ×
    
      
        A
      
    
    →
    
      
        E
      
    
  

{\displaystyle \mu :({\mathcal {A}}\times {\mathcal {E}})^{*}\times {\mathcal {A}}\rightarrow {\mathcal {E}}}

, where the

    ∗
  

{\displaystyle *}

is the Kleene star operation. In general, at time step

    t
  

{\displaystyle t}

(which ranges from 1 to m), AIXI, having previously executed actions

      a
      
        1
      
    
    …
    
      a
      
        t
        −
        1
      
    
  

{\displaystyle a_{1}\dots a_{t-1}}

(which is often abbreviated in the literature as

      a
      
        <
        t
      
    
  

{\displaystyle a_{<t}}

) and having observed the history of percepts

      o
      
        1
      
    
    
      r
      
        1
      
    
    .
    .
    .
    
      o
      
        t
        −
        1
      
    
    
      r
      
        t
        −
        1
      
    
  

{\displaystyle o_{1}r_{1}...o_{t-1}r_{t-1}}

(which can be abbreviated as

      e
      
        <
        t
      
    
  

{\displaystyle e_{<t}}

), chooses and executes in the environment the action,

      a
      
        t
      
    
  

{\displaystyle a_{t}}

, defined as follows:

      a
      
        t
      
    
    :=
    arg
    ⁡
    
      max
      
        
          a
          
            t
          
        
      
    
    
      ∑
      
        
          o
          
            t
          
        
        
          r
          
            t
          
        
      
    
    …
    
      max
      
        
          a
          
            m
          
        
      
    
    
      ∑
      
        
          o
          
            m
          
        
        
          r
          
            m
          
        
      
    
    [
    
      r
      
        t
      
    
    +
    …
    +
    
      r
      
        m
      
    
    ]
    
      ∑
      
        q
        :
        
        U
        (
        q
        ,
        
          a
          
            1
          
        
        …
        
          a
          
            m
          
        
        )
        =
        
          o
          
            1
          
        
        
          r
          
            1
          
        
        …
        
          o
          
            m
          
        
        
          r
          
            m
          
        
      
    
    
      2
      
        −
        
          
            length
          
        
        (
        q
        )
      
    
  

{\displaystyle a_{t}:=\arg \max _{a_{t}}\sum _{o_{t}r_{t}}\ldots \max _{a_{m}}\sum _{o_{m}r_{m}}[r_{t}+\ldots +r_{m}]\sum _{q:\;U(q,a_{1}\ldots a_{m})=o_{1}r_{1}\ldots o_{m}r_{m}}2^{-{\textrm {length}}(q)}}

or, using parentheses, to disambiguate the precedences

      a
      
        t
      
    
    :=
    arg
    ⁡
    
      max
      
        
          a
          
            t
          
        
      
    
    
      (
      
        
          ∑
          
            
              o
              
                t
              
            
            
              r
              
                t
              
            
          
        
        …
        
          (
          
            
              max
              
                
                  a
                  
                    m
                  
                
              
            
            
              ∑
              
                
                  o
                  
                    m
                  
                
                
                  r
                  
                    m
                  
                
              
            
            [
            
              r
              
                t
              
            
            +
            …
            +
            
              r
              
                m
              
            
            ]
            
              (
              
                
                  ∑
                  
                    q
                    :
                    
                    U
                    (
                    q
                    ,
                    
                      a
                      
                        1
                      
                    
                    …
                    
                      a
                      
                        m
                      
                    
                    )
                    =
                    
                      o
                      
                        1
                      
                    
                    
                      r
                      
                        1
                      
                    
                    …
                    
                      o
                      
                        m
                      
                    
                    
                      r
                      
                        m
                      
                    
                  
                
                
                  2
                  
                    −
                    
                      
                        length
                      
                    
                    (
                    q
                    )
                  
                
              
              )
            
          
          )
        
      
      )
    
  

{\displaystyle a_{t}:=\arg \max _{a_{t}}\left(\sum _{o_{t}r_{t}}\ldots \left(\max _{a_{m}}\sum _{o_{m}r_{m}}[r_{t}+\ldots +r_{m}]\left(\sum _{q:\;U(q,a_{1}\ldots a_{m})=o_{1}r_{1}\ldots o_{m}r_{m}}2^{-{\textrm {length}}(q)}\right)\right)\right)}

15 KiB Raw Blame History Unescape Escape

15 KiB

Raw Blame History