12 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Category utility | 2/5 | https://en.wikipedia.org/wiki/Category_utility | reference | science, encyclopedia | 2026-05-05T15:13:02.093903+00:00 | kb-cron |
where
p
(
c
)
{\displaystyle p(c)\ }
is the prior probability of an entity belonging to the positive category
c
{\displaystyle c\ }
(in the absence of any feature information),
p
(
f
i
|
c
)
{\displaystyle p(f_{i}|c)\ }
is the conditional probability of an entity having feature
f
i
{\displaystyle f_{i}\ }
given that the entity belongs to category
c
{\displaystyle c\ }
,
p
(
f
i
|
c
¯
)
{\displaystyle p(f_{i}|{\bar {c}})}
is likewise the conditional probability of an entity having feature
f
i
{\displaystyle f_{i}\ }
given that the entity belongs to category
c
¯
{\displaystyle {\bar {c}}}
, and
p
(
f
i
)
{\displaystyle p(f_{i})\ }
is the prior probability of an entity possessing feature
f
i
{\displaystyle f_{i}\ }
(in the absence of any category information). The intuition behind the above expression is as follows: The term
p
(
c
)
∑
i
=
1
n
p
(
f
i
|
c
)
log
p
(
f
i
|
c
)
{\displaystyle p(c)\textstyle \sum _{i=1}^{n}p(f_{i}|c)\log p(f_{i}|c)}
represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category
c
{\displaystyle c\ }
. Similarly, the term
p
(
c
¯
)
∑
i
=
1
n
p
(
f
i
|
c
¯
)
log
p
(
f
i
|
c
¯
)
{\displaystyle p({\bar {c}})\textstyle \sum _{i=1}^{n}p(f_{i}|{\bar {c}})\log p(f_{i}|{\bar {c}})}
represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category
c
¯
{\displaystyle {\bar {c}}}
. The sum of these two terms in the brackets is therefore the weighted average of these two costs. The final term,
∑
i
=
1
n
p
(
f
i
)
log
p
(
f
i
)
{\displaystyle \textstyle \sum _{i=1}^{n}p(f_{i})\log p(f_{i})}
, represents the cost (in bits) of optimally encoding (or transmitting) feature information when no category information is available. The value of the category utility will, in the above formulation, be non-negative.
=== Category utility and mutual information === Gluck and Corter (1985) and Corter and Gluck (1992) mention that the category utility is equivalent to the mutual information. Here is a simple demonstration of the nature of this equivalence. Assume a set of entities each having the same
n
{\displaystyle n}
features, i.e., feature set
F
=
{
f
i
}
,
i
=
1
…
n
{\displaystyle F=\{f_{i}\},\ i=1\ldots n}
, with each feature variable having cardinality
m
{\displaystyle m}
. That is, each feature has the capacity to adopt any of
m
{\displaystyle m}
distinct values (which need not be ordered; all variables can be nominal); for the special case
m
=
2
{\displaystyle m=2}
these features would be considered binary, but more generally, for any
m
{\displaystyle m}
, the features are simply m-ary. For the purposes of this demonstration, without loss of generality, feature set
F
{\displaystyle F}
can be replaced with a single aggregate variable
F
a
{\displaystyle F_{a}}
that has cardinality
m
n
{\displaystyle m^{n}}
, and adopts a unique value
v
i
,
i
=
1
…
m
n
{\displaystyle v_{i},\ i=1\ldots m^{n}}
corresponding to each feature combination in the Cartesian product
⊗
F
{\displaystyle \otimes F}
. (Ordinality does not matter, because the mutual information is not sensitive to ordinality.) In what follows, a term such as
p
(
F
a
=
v
i
)
{\displaystyle p(F_{a}=v_{i})}
or simply
p
(
v
i
)
{\displaystyle p(v_{i})}
refers to the probability with which
F
a
{\displaystyle F_{a}}
adopts the particular value
v
i
{\displaystyle v_{i}}
. (Using the aggregate feature variable
F
a
{\displaystyle F_{a}}
replaces multiple summations, and simplifies the presentation to follow.) For this demonstration, also assume a single category variable
C
{\displaystyle C}
, which has cardinality
p
{\displaystyle p}
. This is equivalent to a classification system in which there are
p
{\displaystyle p}
non-intersecting categories. In the special case of
p
=
2
{\displaystyle p=2}
there are the two-category case discussed above. From the definition of mutual information for discrete variables, the mutual information
I
(
F
a
;
C
)
{\displaystyle I(F_{a};C)}
between the aggregate feature variable
F
a
{\displaystyle F_{a}}
and the category variable
C
{\displaystyle C}
is given by:
I
(
F
a
;
C
)
=
∑
v
i
∈
F
a
∑
c
j
∈
C
p
(
v
i
,
c
j
)
log
p
(
v
i
,
c
j
)
p
(
v
i
)
p
(
c
j
)
{\displaystyle I(F_{a};C)=\sum _{v_{i}\in F_{a}}\sum _{c_{j}\in C}p(v_{i},c_{j})\log {\frac {p(v_{i},c_{j})}{p(v_{i})\,p(c_{j})}}}
where
p
(
v
i
)
{\displaystyle p(v_{i})}
is the prior probability of feature variable
F
a
{\displaystyle F_{a}}
adopting value
v
i
{\displaystyle v_{i}}
,
p
(
c
j
)
{\displaystyle p(c_{j})}
is the marginal probability of category variable
C
{\displaystyle C}
adopting value
c
j
{\displaystyle c_{j}}
, and
p
(
v
i
,
c
j
)
{\displaystyle p(v_{i},c_{j})}
is the joint probability of variables
F
a
{\displaystyle F_{a}}
and
C
{\displaystyle C}
simultaneously adopting those respective values. In terms of the conditional probabilities this can be re-written (or defined) as