kb/data/en.wikipedia.org/wiki/Word2vec-2.md

6.6 KiB
Raw Blame History

title chunk source category tags date_saved instance
Word2vec 3/4 https://en.wikipedia.org/wiki/Word2vec reference science, encyclopedia 2026-05-05T10:14:12.458984+00:00 kb-cron

== History == During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors. In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling. Word2vec was created, patented, and published in 2013 by a team of researchers led by Mikolov at Google over two papers. The original paper was rejected by reviewers for ICLR conference 2013. It also took months for the code to be approved for open-sourcing. Other researchers helped analyse and explain the algorithm. Embedding vectors created using the Word2vec algorithm have some advantages compared to earlier algorithms such as those using n-grams and latent semantic analysis. GloVe was developed by a team at Stanford specifically as a competitor, and the original paper noted multiple improvements of GloVe over word2vec. Mikolov argued that the comparison was unfair as GloVe was trained on more data, and that the fastText project showed that word2vec is superior when trained on the same data. As of 2022, the straight Word2vec approach was described as "dated". Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in natural language processing.

== Parameterization == Results of word2vec training can be sensitive to parametrization. The following are some important parameters in word2vec training.

=== Training algorithm === A Word2vec model can be trained with hierarchical softmax and/or negative sampling. To approximate the conditional log-likelihood a model seeks to maximize, the hierarchical softmax method uses a Huffman tree to reduce calculation. The negative sampling method, on the other hand, approaches the maximization problem by minimizing the log-likelihood of sampled negative instances. According to the authors, hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors. As training epochs increase, hierarchical softmax stops being useful.

=== Sub-sampling === High-frequency and low-frequency words often provide little information. Words with a frequency above a certain threshold, or below a certain threshold, may be subsampled or removed to speed up training.

=== Dimensionality === Quality of word embedding increases with higher dimensionality. But after reaching some point, marginal gain diminishes. Typically, the dimensionality of the vectors is set to be between 100 and 1,000.

=== Context window === The size of the context window determines how many words before and after a given word are included as context words of the given word. According to the authors' note, the recommended value is 10 for skip-gram and 5 for CBOW.

== Extensions == There are a variety of extensions to word2vec.

=== doc2vec === doc2vec, generates distributed representations of variable-length pieces of texts, such as sentences, paragraphs, or entire documents. doc2vec has been implemented in the C, Python and Java/Scala tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents. doc2vec estimates the distributed representations of documents much like how word2vec estimates representations of words: doc2vec utilizes either of two model architectures, both of which are allegories to the architectures used in word2vec. The first, Distributed Memory Model of Paragraph Vectors (PV-DM), is identical to CBOW other than it also provides a unique document identifier as a piece of additional context. The second architecture, Distributed Bag of Words version of Paragraph Vector (PV-DBOW), is identical to the skip-gram model except that it attempts to predict the window of surrounding context words from the paragraph identifier instead of the current word. doc2vec also has the ability to capture the semantic 'meanings' for additional pieces of 'context' around words; doc2vec can estimate the semantic embeddings for speakers or speaker attributes, groups, and periods of time. For example, doc2vec has been used to estimate the political positions of political parties in various Congresses and Parliaments in the U.S. and U.K., respectively, and various governmental institutions.

=== top2vec === Another extension of word2vec is top2vec, which leverages both document and word embeddings to estimate distributed representations of topics. top2vec takes document embeddings learned from a doc2vec model and reduces them into a lower dimension (typically using UMAP). The space of documents is then scanned using HDBSCAN, and clusters of similar documents are found. Next, the centroid of documents identified in a cluster is considered to be that cluster's topic vector. Finally, top2vec searches the semantic space for word embeddings located near to the topic vector to ascertain the 'meaning' of the topic. The word with embeddings most similar to the topic vector might be assigned as the topic's title, whereas far away word embeddings may be considered unrelated. As opposed to other topic models such as LDA, top2vec provides canonical 'distance' metrics between two topics, or between a topic and another embeddings (word, document, or otherwise). Together with results from HDBSCAN, users can generate topic hierarchies, or groups of related topics and subtopics. Furthermore, a user can use the results of top2vec to infer the topics of out-of-sample documents. After inferring the embedding for a new document, must only search the space of topics for the closest topic vector.

=== BioVectors === An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and proteins) for bioinformatics applications has been proposed by Asgari and Mofrad. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns. A similar variant, dna2vec, has shown that there is correlation between NeedlemanWunsch similarity score and cosine similarity of dna2vec word vectors.