How to Use Words Co-Occurrence Statistics to Map Words to Vectors
In my earlier post, I had described a simplified view of how Word2Vec method uses neural learning to obtain vector representations for words given a large corpora. In this post I will describe another approach for generating high-dimensional representations of words via looking at co-occurrences of words. This method is claimed to be much faster as it avoids time consuming neural learning and yields comparable results.
Words Co-occurrence Statistics
Words co-occurrence statistics describes how words occur together that in turn captures the relationships between words. Words co-occurrence statistics is computed simply by counting how two or more words occur together in a given corpus. As an example of words co-occurrence, consider a corpus consisting of the following documents:
penny wise and pound foolish
a penny saved is a penny earned
Letting count(w_{next}|w_{current}) represent how many times word w_{next} follows the word w_{current}, we can summarize co-occurrence statistics for words “a” and “penny” as:
a | and | earned | foolish | is | penny | pound | saved | wise | |
a | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
penny | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
The above table shows that “a” is followed twice by “penny” while words “earned”, “saved”, and “wise” each follows “penny” once in our corpus. Thus, “earned” is one out of three times probable to appear after “penny.” The count shown above is called bigram frequency; it looks into only the next word from a current word. Given a corpus of N words, we need a table of size NxN to represent bigram frequencies of all possible word-pairs. Such a table is highly sparse as most frequencies are equal to zero. In practice, the co-occurrence counts are converted to probabilities. This results in row entries for each row adding up to one in the co-occurrence matrix.
The concept of looking into words co-occurrences can be extended in many ways. For example, we may count how many times a sequence of three words occurs together to generate trigram frequencies. We may even count how many times a pair of words occurs together in sentences irrespective of their positions in sentences. Such occurrences are called skip-bigram frequencies. Because of such variations in how co-occurrences are specified, these methods in general are known as n-gram methods. The term context window is often used to specify the co-occurrence relationship. For bigrams, the context window is asymmetrical one word long to the right of the current word in co-occurrence counting. For trigrams, it is asymmetrical and two words long. In words to vector conversion approach via co-occurrence, it turns out that a symmetrical context window looking at one preceding word and one following word for computing bigram frequencies gives better word vectors.
Hellinger Distance
Hellinger distance is a measure of similarity between two probability distributions. Given two discrete probability distributions P = (p_{1}, . . . , p_{k}) and Q = (q_{1}, . . . , q_{k}), the Hellinger distance H(P, Q) between the distributions is defined as:
\(H(P, Q) = \frac{1}{\sqrt{2}}\sqrt{{\displaystyle\sum_{i=1}^k}(\sqrt{p_i}-\sqrt{q_i})^2}\)
Hellinger distance is a metric satisfying triangle inequality. The reason for including \(\sqrt(2)\) in the definition of Hellinger distance is to ensure that the distance value is always between 0 and 1. When comparing a pair of discrete probability distributions the Hellinger distance is preferred because P and Q are vectors of unit length as per Hellinger scale.
Word Vectors Generation
The approach in a nutshell is to create an initial vector representation for each word using the co-occurrences of a subset of words from the corpus. This subset of words is called context words. The initial vectors produced in this manner have dimensionality of few thousands with an underlying assumption that the meaning of each word can be captured through its co-occurrence pattern with context words. Next, a dimensionality reduction via principal component analysis (PCA) is carried out to generate final word vectors of low dimensionality, typically 50 or 100 dimensions. The diagram below outlines the different stages of the word vector generation process.
Starting with a given corpus, the first step is to build a dictionary. It is preferable to select only those lower case words for inclusion in the dictionary that occur beyond certain number of times. Also to deal with numbers, it is a good idea to replace all numbers with a special character string representing all numbers. This step generally leads to a manageable dictionary size, \(N_D\), of few hundreds of thousands of words. The next step is to create a list of context words. These are the words only whose appearance in the context window is counted to generate co-occurrence matrix. In terms of pattern recognition analogy, one can view context words as raw features whose presence at different locations (dictionary words) is being detected. Looking at our small corpus example, let us consider “earned”, “foolish”, “penny”, and “saved” as context words. In that case, the bigram frequencies for words “a” and “penny” would be given by the following table:
earned | foolish | penny | saved | |
a | 0 | 0 | 1 | 0 |
penny | 1 | 0 | 0 | 1 |
A typical choice for context words is to choose top 5 to 10% most frequent words resulting in \(N_C\) words. The next step is to calculate \(N_DxN_C\) matrix of co-occurrence probabilities. Typically, a symmetrical window of size three centered at the current word is used to count co-occurrences of context words appearing within the window. The SQRT(2) operation consists of simply dividing every co-occurrence matrix entry by \(\sqrt(2)\) to ensure the range of Hellinger distance between 0-1. The final step is to apply principal component analysis (PCA) to minimize the reconstruction error measured in terms of Hellinger distance between the original and reconstructed vectors. Since the co-occurrence matrix is not a square matrix, PCA is applied via singular value decomposition (SVD) to extract top 50 or 100 eigenvalues and eigenvectors and transform the raw vectors of the co-occurrence matrix to generate final word vectors. See an example in the comments section.
How good are Word Embeddings?
You can check how good is the above approach for word embedding by finding the top 10 nearest neighbors of a word that you can input at the following site:
For example, I got the following list of nearest neighbors for “baseball.” Pretty neat!
The approach briefly described above is not the only approach for word embeddings via co-occurrence statistics. There are few other methods as well as a website where you can compare the performance of different word embedding methods as well as make a two-dimensional mapping of a set of words based on their similarity. One such plot is shown below.
An Interesting Application of Word Embeddings
The work on word embeddings by several researchers has shown that word embeddings well capture word semantics giving rise to searches where concepts can be modified via adding or subtracting word vectors to obtain more meaningful results. One interesting application along these lines is being developed by Stitch Fix Technology for online shopping wherein the vector representations of merchandise is modified by word vectors of keywords to generate recommendations for customers as illustrated by the following example.
We can expect many more applications of word embeddings in near future as researchers try to integrate word vectors with similar representations for images, videos, and other digital artifacts.
Hi, Krishan!
Your website design give me a comfortable experience, it looks like a paper!
But what I want to know is that what the “Dictionary” mean in word vector generation process, is it mean the stemmed text or something else?
And it will be great pleasure if could provide me the original paper!
Thank you~
Dictionary is all the words in the corpus after stemming. There is no paper on this topic by me. All sources used in the write-up are linked. So you can check there for further details.
Hi Krishan: I hope you can give me some pointers here..Learning word2vec/doc2vec resources it looks like the corpus has to be english words with no numbers and special symbols. I am actually trying to see if doc2vec can be used to predict related technical cases in a technical support case corpus. The corpus has a lot of error codes(numbers) and stack trace like “com.java.sdk…” etc etc All the examples I am seeing are taking out the numbers, punctuation, any symbols and coming up with just pure words( after getting rid of stop words)..I am seeing different results with no cleaning of text, cleaning and ending up with just english words etc.. I have just started but probably I am little confused at this stage..Is it mandatory that the text does not contain numbers? Does it make a difference in vector representation if we encode “com.java.sdk…” as one word ( does that make sense at all in VSM) or split into com, java, sdk as separate words..Any inputs are appreciated..Thanks
Hi Prasad,
Thanks for visiting my blog. The whole idea of word2vec is to capture meanings and contexts of words. Since there is no ambiguity in what a number means, there is no need for vector representation for numbers. That is why we filter them out. Insofar, as encoding com.java.sdk as one word or three words is concerned, it might depend upon what you are planning to do with your representation. Java as a stand alone word implies programming as well as coffee and a place in Indonesia. Thus context in this case will be important. On the other hand, com.java.sdk has no ambiguity and might work well if the corpus is technical.
Hope I have been able to provide some help/clarification. Good luck with your project!
Krishan
I am referring this statement in your article ie: “In practice, the co-occurrence counts are converted to probabilities. ” Can you please elaborate it with an example.
The probabilities are calculated by first taking the total of all elements of the co-occurrence matrix. Next you divide each co-occurrence matrix element by the total to get probabilities.
Hi Krishan,
Please excuse my ignorance. But you have the dictionary words in the rows, the context words in the column. Lets say it is M by N matrix and we call it Q. After doing the SVD and keeping, lets say, P non-zero singular values and their corresponding left and right Eigen vectors, you end up with a new matrix; lets call it Q2.
Now how do you use the Hellinger distance to find the closets neighbors of the word Baseball?
Is Baseball part of the dictionary words, and you take the row corresponding to Baseball and Q2 and find the nearest Rows to it using Hellinger?
Or is it in the context words and you do that above with the column corresponding to Baseball in Q2?
Thanks in advance for your reply. E
You compare the row representing Baseball with other rows after SVD to find top ten or twenty closet words. You do this by using the Hellinger distance except the square root by two part because that is already included in getting SVD.
hi,
first thank you a lot
then, could you give us an example to explain what really happen
like your previous post
thanks.
I will try to give a simple example. Let us say that we have determined the cooccurence matrix that looks like as below. In this matrix the counts have been converted to probabilities so that sum of probabilities along each row equal 1.
breeds computing cover food is meat named of
cat 0.04 0.00 0.00 0.13 0.53 0.02 0.18 0.10
dog 0.11 0.00 0.00 0.12 0.39 0.06 0.15 0.17
cloud 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.40
Next, we take the square root of every cooccurence matrix entry to use the Hellinger distance (See the post). Thus the above matrix reduces to:
breeds computing cover food is meat named of
cat 0.20 0.00 0.00 0.36 0.73 0.14 0.42 0.32
dog 0.33 0.00 0.00 0.35 0.62 0.24 0.39 0.41
cloud 0.00 0.54 0.44 0.00 0.35 0.00 0.00 0.63
We now apply SVD(PCA) on this. The SVD operation yields a diagonal matrix of singular values, a matrix of left singular vectors (U matrix) and a matrix of right singular vectors. These are shown below:
Diagonal matrix of singular values
[,1] [,2] [,3]
[1,] 1.5 0.00 0.00
[2,] 0.0 0.82 0.00
[3,] 0.0 0.00 0.16
U matrix of left singular vectors
[,1] [,2] [,3]
[1,] -0.63 -0.34 -0.701
[2,] -0.63 -0.30 0.712
[3,] -0.45 0.89 -0.023
Note that columns of U matrix are orthogonal to each other; you can verify it by taking dot product between any column pair.
The V matrix of right singular vectors
[,1] [,2] [,3]
[1,] -0.22 -0.20 0.615
[2,] -0.16 0.59 -0.081
[3,] -0.13 0.48 -0.065
[4,] -0.29 -0.27 -0.039
[5,] -0.66 -0.15 -0.472
[6,] -0.16 -0.15 0.483
[7,] -0.34 -0.32 -0.138
[8,] -0.49 0.41 0.366
The columns of this matrix are also orthogonal. At this point, we have mapped all words in a 3-dimensional space. Each row of U matrix is a 3-dimensional vector representation of word “cat”, “dog”, and “cloud.” In the same manner each row of V matrix is a 3-dimensional vector representation for respective context words. Typically in a practical setting, we will have 10,000 or higher dimensional space and we would want that space to be reduced to 50 or 100 dimensions to obtain word vectors. In the present example, lets say we want to reduce the dimensionality to 2 by discarding the smallest singular value. To obtain the reduce representation, all we have to do is to set the third singular value to 0 in the diagonal matrix and calculate the matrix product of U and modified diagonal matrix. This will give the following result after discarding the third column of all zeros:
Dim1 Dim2
cat -0.96 -0.27
dog -0.96 -0.25
cloud -0.68 0.73
In a similar fashion, we can perform multiplication of V matrix with modified diagonal matrix to get the following representation for context words:
breeds computing cover food is meat named of
Dim1 -0.34 -0.24 -0.20 -0.45 -1.01 -0.24 -0.51 -0.74
Dim2 -0.17 0.48 0.39 -0.22 -0.12 -0.12 -0.26 0.33
The above sequence of calculations illustrate the basic steps that are performed to generate vector representations based on cooccurences of words.
Hi. Thank you for this article. I have a few questions:
Q1: After you applied SVD on the cooccurence matrix the space in which the words were represented had three dimensions/ Why is that? On what factor does the number of resulting dimensions depend?
Q2: What happens if I want the representation of a word that is nor in the initial dictionary?
Q3: If the context is represented by the first 5-10% of words in the dictionary I think it is highly probable that many of them would be articles, prepositions and conjuctions which are not that relevant. Am I right and if so, how do you propose to address this problem?
Thank you again and have a beautiful day!
Below are answers to your questions.
Q1. We got three dimensions in the example because our co-occurrence matrix is of size 3 x 8. If it was p x q, where p is smaller than q, then it would be a p-dimensional representation
Q2. If a word is not in the initial dictionary, it is possible to map it in the representation in a manner similar to what is done in latent semantic indexing. You can find information on this on page 563 in “Foundations of Statistical Natural Language Processing” by Manning and Schutze
Q3. Context is represented by surrounding words. Words representing articles, prepositions and conjunctions can be (and usually are) filtered out using stop-word filtering.
Hope I have been of some help. Good luck!
hi! Great post!
Can you explain how to create symmetrical context windows for co-occurrence matrices?
Also, can you help to explain the relation between the normalized co-occurrence matrix and the covariance matrix?
Thanks. I am glad you liked it. A symmetrical context window will have equal number of even words on both sides of the focus word and you will count the joint occurrences of the words in the window with the word at the center of the window.
A covariance matrix captures the relationships between pairs of variables. For example, it captures whether one variable value rises/falls/unchanged with respect to another variable. A co-occurrence matrix on the other hand simply captures whether two words/events occur together or not.
Hey great read! Is there a code snippet for this so as to make the understanding solid?
Glad you liked it. I don’t have a code snippet but if you scroll down the comments, I have a numerical example that might help.