kb/data/en.wikipedia.org/wiki/Pattern_recognition-1.md

9.5 KiB
Raw Blame History

title chunk source category tags date_saved instance
Pattern recognition 2/4 https://en.wikipedia.org/wiki/Pattern_recognition reference science, encyclopedia 2026-05-05T06:38:04.628167+00:00 kb-cron

Many common pattern recognition algorithms are probabilistic in nature, in that they use statistical inference to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output a probability of the instance being described by the given label. In addition, many probabilistic algorithms output a list of the N-best labels with associated probabilities, for some value of N, instead of simply a single best label. When the number of possible labels is fairly small (e.g., in the case of classification), N may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms:

They output a confidence value associated with their choice. (Note that some other algorithms may also output confidence values, but in general, only for probabilistic algorithms is this value mathematically grounded in probability theory. Non-probabilistic confidence values can in general not be given any specific meaning, and only used to compare against other confidence values output by the same algorithm.) Correspondingly, they can abstain when the confidence of choosing any particular output is too low. Because of the probabilities output, probabilistic pattern-recognition algorithms can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation.

=== Number of important feature variables === Feature selection algorithms attempt to directly prune out redundant or irrelevant features. A general introduction to feature selection which summarizes approaches and challenges, has been given. The complexity of feature-selection is, because of its non-monotonous character, an optimization problem where given a total of

    n
  

{\displaystyle n}

features the powerset consisting of all

      2
      
        n
      
    
    
    1
  

{\displaystyle 2^{n}-1}

subsets of features need to be explored. The Branch-and-Bound algorithm does reduce this complexity but is intractable for medium to large values of the number of available features

    n
  

{\displaystyle n}

Techniques to transform the raw feature vectors (feature extraction) are sometimes used prior to application of the pattern-matching algorithm. Feature extraction algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such as principal components analysis (PCA). The distinction between feature selection and feature extraction is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features.

== Problem statement == The problem of pattern recognition can be stated as follows: Given an unknown function

    g
    :
    
      
        X
      
    
    →
    
      
        Y
      
    
  

{\displaystyle g:{\mathcal {X}}\rightarrow {\mathcal {Y}}}

(the ground truth) that maps input instances

      x
    
    ∈
    
      
        X
      
    
  

{\displaystyle {\boldsymbol {x}}\in {\mathcal {X}}}

to output labels

    y
    ∈
    
      
        Y
      
    
  

{\displaystyle y\in {\mathcal {Y}}}

, along with training data

      D
    
    =
    {
    (
    
      
        x
      
      
        1
      
    
    ,
    
      y
      
        1
      
    
    )
    ,
    …
    ,
    (
    
      
        x
      
      
        n
      
    
    ,
    
      y
      
        n
      
    
    )
    }
  

{\displaystyle \mathbf {D} =\{({\boldsymbol {x}}_{1},y_{1}),\dots ,({\boldsymbol {x}}_{n},y_{n})\}}

assumed to represent accurate examples of the mapping, produce a function

    h
    :
    
      
        X
      
    
    →
    
      
        Y
      
    
  

{\displaystyle h:{\mathcal {X}}\rightarrow {\mathcal {Y}}}

that approximates as closely as possible the correct mapping

    g
  

{\displaystyle g}

. (For example, if the problem is filtering spam, then

        x
      
      
        i
      
    
  

{\displaystyle {\boldsymbol {x}}_{i}}

is some representation of an email and

    y
  

{\displaystyle y}

is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. In decision theory, this is defined by specifying a loss function or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize the expected loss, with the expectation taken over the probability distribution of

        X
      
    
  

{\displaystyle {\mathcal {X}}}

. In practice, neither the distribution of

        X
      
    
  

{\displaystyle {\mathcal {X}}}

nor the ground truth function

    g
    :
    
      
        X
      
    
    →
    
      
        Y
      
    
  

{\displaystyle g:{\mathcal {X}}\rightarrow {\mathcal {Y}}}

are known exactly, but can be computed only empirically by collecting a large number of samples of

        X
      
    
  

{\displaystyle {\mathcal {X}}}

and hand-labeling them using the correct value of

        Y
      
    
  

{\displaystyle {\mathcal {Y}}}

(a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case of classification, the simple zero-one loss function is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes the error rate on independent test data (i.e. counting up the fraction of instances that the learned function

    h
    :
    
      
        X
      
    
    →
    
      
        Y
      
    
  

{\displaystyle h:{\mathcal {X}}\rightarrow {\mathcal {Y}}}

labels wrongly, which is equivalent to maximizing the number of correctly classified instances). The goal of the learning procedure is then to minimize the error rate (maximize the correctness) on a "typical" test set. For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e., to estimate a function of the form

    p
    (
    
      
        l
        a
        b
        e
        l
      
    
    
      |
    
    
      x
    
    ,
    
      θ
    
    )
    =
    f
    
      (
      
        
          x
        
        ;
        
          θ
        
      
      )
    
  

{\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})=f\left({\boldsymbol {x}};{\boldsymbol {\theta }}\right)}

where the feature vector input is

      x
    
  

{\displaystyle {\boldsymbol {x}}}

, and the function f is typically parameterized by some parameters

      θ
    
  

{\displaystyle {\boldsymbol {\theta }}}

. In a discriminative approach to the problem, f is estimated directly. In a generative approach, however, the inverse probability

    p
    (
    
      
        x
      
      
        |
      
      
        
          l
          a
          b
          e
          l
        
      
    
    )
  

{\displaystyle p({{\boldsymbol {x}}|{\rm {label}}})}

is instead estimated and combined with the prior probability

    p
    (
    
      
        l
        a
        b
        e
        l
      
    
    
      |
    
    
      θ
    
    )
  

{\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})}

using Bayes' rule, as follows: