kb/data/en.wikipedia.org/wiki/Pattern_recognition-2.md

---
title: "Pattern recognition"
chunk: 3/4
source: "https://en.wikipedia.org/wiki/Pattern_recognition"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T03:56:51.565383+00:00"
instance: "kb-cron"
---


        p
        (


            l
            a
            b
            e
            l


          |


          x

        ,

          θ

        )
        =


              p
              (


                  x


                  |


                    l
                    a
                    b
                    e
                    l
                    ,

                      θ


              )
              p
              (


                  l
                  a
                  b
                  e
                  l

                    |


                    θ


              )


                ∑

                  L
                  ∈

                    all labels


              p
              (

                x


                |

              L
              )
              p
              (
              L

                |


                θ

              )


        .


    {\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})={\frac {p({{\boldsymbol {x}}|{\rm {label,{\boldsymbol {\theta }}}}})p({\rm {label|{\boldsymbol {\theta }}}})}{\sum _{L\in {\text{all labels}}}p({\boldsymbol {x}}|L)p(L|{\boldsymbol {\theta }})}}.}


When the labels are continuously distributed (e.g., in regression analysis), the denominator involves integration rather than summation:


        p
        (


            l
            a
            b
            e
            l


          |


          x

        ,

          θ

        )
        =


              p
              (


                  x


                  |


                    l
                    a
                    b
                    e
                    l
                    ,

                      θ


              )
              p
              (


                  l
                  a
                  b
                  e
                  l

                    |


                    θ


              )


                ∫

                  L
                  ∈

                    all labels


              p
              (

                x


                |

              L
              )
              p
              (
              L

                |


                θ

              )
              d
              ⁡
              L


        .


    {\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})={\frac {p({{\boldsymbol {x}}|{\rm {label,{\boldsymbol {\theta }}}}})p({\rm {label|{\boldsymbol {\theta }}}})}{\int _{L\in {\text{all labels}}}p({\boldsymbol {x}}|L)p(L|{\boldsymbol {\theta }})\operatorname {d} L}}.}


The value of


          θ


    {\displaystyle {\boldsymbol {\theta }}}

 is typically learned using maximum a posteriori (MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data (smallest error-rate) and to find the simplest possible model. Essentially, this combines maximum likelihood estimation with a regularization procedure that favors simpler models over more complex models. In a Bayesian context, the regularization procedure can be viewed as placing a prior probability


        p
        (

          θ

        )


    {\displaystyle p({\boldsymbol {\theta }})}

 on different values of


          θ


    {\displaystyle {\boldsymbol {\theta }}}

. Mathematically:


            θ


            ∗


        =
        arg
        ⁡

          max

            θ


        p
        (

          θ


          |


          D

        )


    {\displaystyle {\boldsymbol {\theta }}^{*}=\arg \max _{\boldsymbol {\theta }}p({\boldsymbol {\theta }}|\mathbf {D} )}


where


            θ


            ∗


    {\displaystyle {\boldsymbol {\theta }}^{*}}

 is the value used for


          θ


    {\displaystyle {\boldsymbol {\theta }}}

 in the subsequent evaluation procedure, and


        p
        (

          θ


          |


          D

        )


    {\displaystyle p({\boldsymbol {\theta }}|\mathbf {D} )}

, the posterior probability of


          θ


    {\displaystyle {\boldsymbol {\theta }}}

, is given by


        p
        (

          θ


          |


          D

        )
        =

          [


              ∏

                i
                =
                1


                n


            p
            (

              y

                i


              |


                x


                i


            ,

              θ

            )

          ]

        p
        (

          θ

        )
        .


    {\displaystyle p({\boldsymbol {\theta }}|\mathbf {D} )=\left[\prod _{i=1}^{n}p(y_{i}|{\boldsymbol {x}}_{i},{\boldsymbol {\theta }})\right]p({\boldsymbol {\theta }}).}


In the Bayesian approach to this problem, instead of choosing a single parameter vector


            θ


            ∗


    {\displaystyle {\boldsymbol {\theta }}^{*}}

, the probability of a given label for a new instance


          x


    {\displaystyle {\boldsymbol {x}}}

 is computed by integrating over all possible values of


          θ


    {\displaystyle {\boldsymbol {\theta }}}

, weighted according to the posterior probability:


        p
        (


            l
            a
            b
            e
            l


          |


          x

        )
        =
        ∫
        p
        (


            l
            a
            b
            e
            l


          |


          x

        ,

          θ

        )
        p
        (

          θ


          |


          D

        )
        d
        ⁡

          θ

        .


    {\displaystyle p({\rm {label}}|{\boldsymbol {x}})=\int p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})p({\boldsymbol {\theta }}|\mathbf {D} )\operatorname {d} {\boldsymbol {\theta }}.}


=== Frequentist or Bayesian approach to pattern recognition ===
The first pattern classifier – the linear discriminant presented by Fisher – was developed in the frequentist tradition. The frequentist approach entails that the model parameters are considered unknown, but objective. The parameters are then computed (estimated) from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and the covariance matrix. Also the probability of each class


        p
        (


            l
            a
            b
            e
            l


          |


          θ

        )


    {\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})}

 is estimated from the collected dataset. Note that the usage of 'Bayes' rule' in a pattern classifier does not make the classification approach Bayesian.
Bayesian statistics has its origin in Greek philosophy where a distinction was already made between the 'a priori' and the 'a posteriori' knowledge. Later Kant defined his distinction between what is a priori known – before observation – and the empirical knowledge gained from observations. In a Bayesian pattern classifier, the class probabilities


        p
        (


            l
            a
            b
            e
            l


          |


          θ

        )


    {\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})}

 can be chosen by the user, which are then a priori. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations – using e.g., the Beta- (conjugate prior) and Dirichlet-distributions. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations.
Probabilistic pattern classifiers can be used according to a frequentist or a Bayesian approach.

== Uses ==

Within medical science, pattern recognition is the basis for computer-aided diagnosis (CAD) systems. CAD describes a procedure that supports the doctor's interpretations and findings. Other typical applications of pattern recognition techniques are automatic speech recognition, speaker identification, classification of text into several categories (e.g., spam or non-spam email messages), the automatic recognition of handwriting on postal envelopes, automatic recognition of images of human faces, or handwriting image extraction from medical forms. The last two examples form the subtopic image analysis of pattern recognition that deals with digital images as input to pattern recognition systems.
Optical character recognition is an example of the application of a pattern classifier. The method of signing one's name was captured with stylus and overlay starting in 1990. The strokes, speed, relative min, relative max, acceleration and pressure is used to uniquely identify and confirm identity. Banks were first offered this technology, but were content to collect from the FDIC for any bank fraud and did not want to inconvenience customers.
Pattern recognition has many real-world applications in image processing. Some examples include: