Abstract

This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classiﬁer. The method uses term co-occurrence as a measure of dependency between word features. A random-walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantiﬁcation of how a particular word feature contributes to a given context. Experiments performed on three standard classiﬁcation datasets show that the new random-walk based approach outperforms the traditional term frequency approach of feature weighting.

Introduction

Term frequency has long been used as a major factor for estimating the probabilistic distribution of features in a document, and it has been employed in a broad spectrum of tasks including language modeling [18], feature selection [29, 24], and term weighting [13, 20]. The main drawback associated with the term frequency method is the fact that it relies on a bag-of-words approach. It implies feature independence, and disregards any dependencies that may exist between words in the text. In other words, it deﬁnes a ”random choice,” where the weight of the term is proportional to the probability of choosing the term randomly from the set of terms that constitute the text.

Such an approach might be effective for capturing the relevance of a term in a local context, but it fails to account for the global effect that the term’s existence exerts on the entire text segment. We argue that the bag-of-words model may not be the best technique to capture term importance. Instead, given that relations in the text could be preserved by maintaining the structural representation of the text, a method that takes into account the structural properties of the context could lead to a better term weighting scheme. Previous work has shown that a higher but costly performance can be achieved 1

by incorporating such dependencies [22]. In this paper we introduce a system that models the weighting problem as a ”random-walk” rather than ”random-choice.” We assume an imaginary reader (or “walker”) who steps through the text on a term by term basis. In this setting, the importance of the term is determined by the probability of the random-walker to encounter the target term in the text during the walk. The new measure of term weighting integrates both the locality of a term and its relation to the surrounding context. We model this local contribution using a co-occurrence relation in which terms that co-occur in a certain context are likely to share between them some of their importance (or signiﬁcance). Note that in this model the relation between a given term and its context is not linear. A given term relates to a context, and the context, in turn, relates to a collection of terms. In order to model this recursive relation, we use a graph-based ranking algorithm, namely the PageRank random-walk algorithm [2], and its TextRank adaptation to text processing [15].

In this paper, we show how TextRank can be used to model the probabilistic distribution of word features in a document. Through experiments performed on a text classiﬁcation task, we show that the random-walk scores outperform the traditional term frequencies, typically used to model feature weights for this task. In the following, we ﬁrst overview the basic principles behind random-walk algorithms, and brieﬂy describe the TextRank application for text processing. We then show how these random-walk models can be adapted to term weighting, and demonstrate that the new weighting scheme can be used to signiﬁcantly improve the accuracy of a text classiﬁcation system, as compared to the traditional term frequency weighting scheme. Finally, we conclude with a discussion and directions for future work.

The basic idea implemented by a random-walk algorithm is that of “voting” or “recommendation.” When one vertex links to another one, it is basically casting a vote for

that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting a vote determines how important the vote itself is, and this information is also taken into account by the ranking algorithm. While there are several random-walk algorithms that have been proposed in the past, we focus on only one such algorithm, namely PageRank [2], as it was previously found successful in a number of applications, including Web link analysis [2], social networks [8], citation analysis, and more recently in several text processing applications [15, 9]. Given a graph G = (V, E), let In(Va ) be the set of vertices that point to vertex Va (predecessors), and Out(Va ) be the set of vertices that vertex Va points to (successors). The PageRank score associated with the vertex Va is deﬁned using a recursive function that integrates the scores of its predecessors: S(Va ) = (1 − d) + d ∗

Vb ∈In(Va )

porating a larger number of lexical units, and use different window sizes, as we will show in the following section.

3

Random-Walks for Term Weighting

Starting with a given document, we determine a ranking over the words in the document by using the following models.

3.1

Random-walk Models

S(Vb ) |Out(Vb )|

(1)

where d is a parameter that is set between 0 and 11 . The score of each vertex is recalculated upon each iteration based on the new weights that the neighboring vertices have accumulated. The algorithm terminates when the convergence point is reached for all the vertices, meaning that the error rate for each vertex falls below a pre-deﬁned threshold. This vertex scoring scheme is based on a random-walk model, where a walker takes random steps on the graph, with the walk being modeled as a Markov process. Under certain conditions (the graph is aperiodic and irreducible), the model is guaranteed to converge to a stationary distribution of probabilities associated with the vertices in the graph [10]. Intuitively, the stationary probability associated with a vertex represents the probability of ﬁnding the walker at that vertex during the random-walk, and thus it represents the importance of the vertex within the graph.

Particularly relevant for our work is the application of random-walks to text processing, as done in the TextRank system [15]. TextRank has been successfully applied to three natural language processing tasks: document summarization, word sense disambiguation, and keyword extraction, with results competitive with those of state-of-the-art systems. The strength of the model lies in the global representation of the context and its ability to model how the co-occurrence between features might propagate across the context and affect other distant features. Our approach follows similar steps as used in the TextRank keyword extraction application, which derives term weights using a graph representation that accounts for the co-occurrence dependencies between words in the text. We are however incor1 The typical value for d is 0.85 [2], and this is the value we are using in our implementation.

In our work, we experimented with several variations of PageRank that incorporate additional information and variables into the traditional version shown in (Equation 1). We summarize the best PageRank-based term ranking models as follows: ↔ rwo : It represents the basic or original model, as described in (Equation 1) in which we use an undirected graph with a constant damping factor that adheres strictly to the traditional formula of PageRank. ↔ rwe.idf : This model represents an undirected graph approach that uses the weighted edge version of PageRank with a variable damping factor. The edge weight is calculated by the following formula: EV1 ,V2 = tf.idfV1 ∗ tf.idfV2 (2) where EV1 ,V2 is the edge connecting V1 to V2 , and tf.idf represents the term frequency multiplied by the inverse document frequency. The damping factor is expressed as a function of the incoming edges’ weight, calculated as follows: dEV

1 ,V2

= EV1 ,V2 /Emax

(3)

where dEV1 ,V2 is the damping function and Emax represents the highest weight for an edge in the graph. The resulting node ranking formula is: S (Va ) = (1 − d) + |N | C∗

Vb ∈In(Va )

dEV

b ,Va

∗ S(Vb )

|Out(Vb )|

(4)

where N represents the total number of nodes in the graph and d is the damping constant, C is a scaling constant2 , In order to address the cases where there are no incoming edges, we set the vertex scores in our experiments to Vmin = (1 − d)/N . The model biases the random walker toward nodes with stronger edges compared to nodes with weaker edges. ↔ rwe.oc : This model is similar to the above approach however the damping factor for an edge is estimated in terms of the bigram co-occurrence frequency of the two nodes connected by the edge (equation 5). For example, if the bigram ”‘free software”’ occurred four times in a document then the weight of the edge connecting ”‘free”’ and ”‘software”’ is four. 2C

is a scaling constant which is set to 0.95

2

EV1 ,V2 = tf (V1 V2 )

(5)

3.3

An Example

To understand why the rw weights might be a good replacement for the traditional tf weights, consider the example in Figure 2, which models a sample document. Starting with this text, a graph is constructed as follows. If a term has not been previously seen, then a node is added to the graph to represent this term. A term can only be represented by one node in the graph. An undirected edge is drawn between two nodes if they co-occur within a certain window size. Figure 3 shows the graph constructed for this text, assuming a window size of 2, corresponding to two consecutive terms in the text (e.g. London is linked to based). London-based sugar operator Kaines Ltd conﬁrmed it sold two cargoes of white sugar to India out of an estimated overall sales total of four or ﬁve cargoes in which other brokers participated. The sugar, for April/May and April/June shipment, was sold at between 214 and 218 dlrs a tonne cif, it said.

Figure 1. Convergence random-walk models

graphs for the

Figure 2. Sample document

3.2

Term Weighting

cif tonne dlrs shipment

London based

Given a document, the following steps are applied to derive a weight associated with the words in the text: First, the document is tokenized for punctuation, special symbols, and word abbreviations. Common words are also removed, using a list of approximately 500 frequently used words.3 Next, the resulting text is processed to extract both term frequency (tf ) and random-walk (rw) weights for each term in the document. Note that we do not apply any syntactic ﬁlters, as it was previously done in applications of TextRank. Instead, we consider each word as a potential feature. To determine tf , we simply count the frequencies of each word in the document. To determine rw, all the terms are added as vertices in a graph representing the document.

A co-occurrence scanner is then applied to the text to relate the terms that co-occur within a given window size. For a given term, all the terms that fall in the vicinity of this term are considered dependent terms. This is represented by a set of edges that connect the term to all the other terms in the window. Experiments are performed for window sizes of 2, 4, 6, and 8. Once the graph is constructed and the edges are in place, the random-walk algorithm is applied.4 The result of the ranking is a list of all the input terms and their corresponding rw scores. 3 We use the list of common words distributed with the Smart system ftp://ftp.cs.cornell.edu/pub/smart. 4 Unless otherwise stated, throughout this paper we refer to a randomwalk implementation where the damping factor is set to 0.85, and the convergence threshold to 0.0001. Each graph node is assigned an initial weight of 0.25.

sugar operator

June May Kaines April participated brokers confirmed sold cargoes total sales estimated India white

Figure 3. Sample graph

Term sugar sold based conﬁrmed Kaines operator London cargoes shipment dlrs white rw 16.88 14.15 7.39 6.90 6.81 6.76 4.14 4.01 4.01 4.01 3.87 tf 3 2 1 1 1 1 1 2 1 1 1 Term participated april india estimated sales total brokers may june tonne cif rw 3.87 3.87 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 tf 1 2 1 1 1 1 1 1 1 1 1

Table 1. tf & rw scores for a sample text After the graph is constructed, the random-walk model is applied on the graph, resulting in a set of scores associated with the vertices (words). Table 1 shows the tf and rw 3 weights. By analyzing the rw weights, we can observe a non-linear correlation with the tf weights, with an emphasis given to terms surrounding important key-terms such as e.g. sugar or cargoes. This spatial locality has resulted in higher ranks for terms like operator compared to other terms like london.5 best hyper-plan that separates the set of points associated with different class labels with a maximum-margin. The unlabeled examples are then classiﬁed by deciding on which side of the hyper-surface they reside. In our evaluations we use SV M T orch [5] with a linear kernel, since it was proved to be as powerful as other kernels in text classiﬁcation experiments [28]. This SVM implementation is also observed to be the fastest when compared to SV M lib and W eka s SM O [12].

To evaluate the random-walk based approach to feature weighting, we integrate it in a text classiﬁcation algorithm, and evaluate its performance on several standard text classiﬁcation datasets. Text classiﬁcation is a problem typically formulated as a machine learning task, where a classiﬁer learns how to distinguish between categories in a given set by using features automatically extracted from a collection of training documents. We use the tf and rw feature weights (and their alternatives, as described below) to create feature vectors for the Support Vector Machines (SVM) and the Na¨ve Bayes clası siﬁers. Following standard practice in term weighting for a Rocchio classiﬁer, we use the tf.idf and rw.idf feature weights in our initial evaluation of the models (Tables 2, 3, 5, 6).6 In the ﬁnal experiments (Tables 7 and 8) we report tf and tf.idf results separately. The results obtained using tf will act as a baseline for all the evaluations.

Text Classiﬁers

We compare the results obtained with three frequently used text classiﬁers – Rocchio, Na¨ve Bayes, and SVM, seı lected based on their performance and diversity of learning methodologies. Na¨ve Bayes. The basic idea in a Na¨ve Bayes text clası ı siﬁer is to estimate the probability of a category given a document using joint probabilities of words and documents. Na¨ve Bayes text classiﬁers assume word independence, but ı despite this simpliﬁcation, they were shown to perform surprisingly well [11, 23]. Rocchio. The Rocchio text classiﬁcation method uses standard term weighted vectors to represent documents, and builds a prototype vector for each category by summing up the vectors of the training documents in each category. Test documents are then assigned to the category with the closest prototype vector, based on cosine similarity. Classiﬁcation experiments with different versions of the algorithm showed competitive results on standard benchmarks [11, 16]. SVM. SVM [27] is a state-of-the-art machine learning approach based on decision plans. The algorithm deﬁnes the the missing words e.g. Ltd, it, not shown in the graph are common words that were eliminated during pre-processing.

We use three standard datasets: W ebKB, LingSpam, and 20N ewsgroups – commonly used in text classiﬁcation evaluations [26, 1, 23]. WebKB7 is a data set collected from computer science departments of various universities. The dataset contains seven class labels: Project, Student, Department, Faculty, Staff, Course, and Other. The Other label was removed from the dataset for evaluation purposes. Most of the evaluations in the literature have been performed on only four of the categories (Project, Student, Faculty, and Course) since they represent the largest categories. However, since we wanted to see how our system behaves when only a few training examples are available, we also considered the Staff and the Department classes which have only a few training documents available.

We performed our evaluations on two versions of W ebKB: one with the four categories version (W ebKB4 ) and one with the six categories (W ebKB6 ). 20Newsgroups8 is a collection of 20,000 messages from 20 newsgroups, corresponding to different topics or subjects. Each newsgroup has about 1000 message split into 400 test and 600 train documents. LingSpam9 is a spam corpus [1], consisting of email messages organized in 10 sets to allow for 10-fold cross validation. Each collection has roughly 300 spam and legitimate messages. There are four versions of the corpus standing for bare, stop-word ﬁltered, lemmatized, and stop-word and lemmatized. We use the bare collection with a standard 10fold cross validation.