Title: | Latent Semantic Analysis |
---|---|
Description: | The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome. |
Authors: | Fridolin Wild |
Maintainer: | Fridolin Wild <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.73.3 |
Built: | 2025-02-13 04:57:57 UTC |
Source: | https://github.com/cran/lsa |
This character string contains a regular expression for use in gsub
deployed in textvector
that identifies all alphanumeric characters (including language specific special characters not included in [:alnum:]
, currently only the ones found in German and Polish.
You can use this expression by loading it with data(alnumx)
.
data(alnumx)
data(alnumx)
Vector of type character.
Fridolin Wild [email protected]
Returns a latent semantic space (created by createLSAspace) in textmatrix format: rows are terms, columns are documents.
as.textmatrix( LSAspace )
as.textmatrix( LSAspace )
LSAspace |
a latent semantic space generated by createLSAspace. |
To allow comparisons between terms and documents, the internal
format of the latent semantic space needs to be converted to
a classical document-term matrix (just like the ones generated by
textmatrix()
that are of class ‘textmatrix’).
Remark: There are other ways to compare documents and terms using the partial matrices from an LSA space directly. See (Berry, 1995) for more information.
textmatrix |
a textmatrix representation of the latent semantic space. |
Fridolin Wild [email protected]
Berry, M., Dumais, S., and O'Brien, G (1995) Using Linear Algebra for Intelligent Information Retrieval. In: SIAM Review, Vol. 37(4), pp.573–595.
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # read files into a document-term matrix myMatrix = textmatrix(td, minWordLength=1) # create the latent semantic space myLSAspace = lsa(myMatrix, dims=dimcalc_raw()) # display it as a textmatrix again round(as.textmatrix(myLSAspace),2) # should give the original # create the latent semantic space myLSAspace = lsa(myMatrix, dims=dimcalc_share()) # display it as a textmatrix again myNewMatrix = as.textmatrix(myLSAspace) myNewMatrix # should look be different! # compare two terms with the cosine measure cosine(myNewMatrix["dog",], myNewMatrix["cat",]) # compare two documents with pearson cor(myNewMatrix[,1], myNewMatrix[,2], method="pearson") # clean up unlink(td, recursive=TRUE)
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # read files into a document-term matrix myMatrix = textmatrix(td, minWordLength=1) # create the latent semantic space myLSAspace = lsa(myMatrix, dims=dimcalc_raw()) # display it as a textmatrix again round(as.textmatrix(myLSAspace),2) # should give the original # create the latent semantic space myLSAspace = lsa(myMatrix, dims=dimcalc_share()) # display it as a textmatrix again myNewMatrix = as.textmatrix(myLSAspace) myNewMatrix # should look be different! # compare two terms with the cosine measure cosine(myNewMatrix["dog",], myNewMatrix["cat",]) # compare two documents with pearson cor(myNewMatrix[,1], myNewMatrix[,2], method="pearson") # clean up unlink(td, recursive=TRUE)
Returns those terms above a threshold close to the input term, sorted in descending order of their closeness. Alternatively, all terms and their closeness value can be returned sorted descending.
associate(textmatrix, term, measure = "cosine", threshold = 0.7)
associate(textmatrix, term, measure = "cosine", threshold = 0.7)
textmatrix |
A document-term matrix. |
term |
The stimulus 'word'. |
measure |
The closeness measure to choose (Pearson, Spearman, Cosine) |
threshold |
Terms being closer than this threshold are going to be returned. |
Internally, a complete term-to-term similarity table is calculated, denoting the closeness (calculated with the specified measure) in its cells. All terms being close above this specified threshold are returned, sorted by their closeness value. Select a threshold of 0 to get all terms.
termlist |
A named vector of closeness values (terms as labels, sorted in descending order). |
Fridolin Wild [email protected]
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # create matrices myMatrix = textmatrix(td, minWordLength=1) myLSAspace = lsa(myMatrix, dims=dimcalc_share()) myNewMatrix = as.textmatrix(myLSAspace) # calc associations for mouse associate(myNewMatrix, "mouse") # clean up unlink(td, recursive=TRUE)
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # create matrices myMatrix = textmatrix(td, minWordLength=1) myLSAspace = lsa(myMatrix, dims=dimcalc_share()) myNewMatrix = as.textmatrix(myLSAspace) # calc associations for mouse associate(myNewMatrix, "mouse") # clean up unlink(td, recursive=TRUE)
This data sets contain example corpora for essay scoring.
A training textmatrix contains files to construct a
latent semantic space apt for grading student essays
provided in the essay textmatrix. In a separate data
set, the original human scores are noted down with
which the student essays were graded by a human
assessor. The corpora (and human scores) can be loaded by
calling data(corpus_training)
, data(corpus_essays)
, or
data(corpus_scores)
. The objects must already
exist before being handed over to e.g. lsa()
.
data(corpus_training) data(corpus_essays) data(corpus_scores)
data(corpus_training) data(corpus_essays) data(corpus_scores)
Corpora: textmatrix; Scores: table.
Fridolin Wild [email protected]
Calculates the cosine measure between two vectors or between all column vectors of a matrix.
cosine(x, y = NULL)
cosine(x, y = NULL)
x |
A vector or a matrix (e.g., a document-term matrix). |
y |
Optional: a vector with compatible dimensions to |
cosine()
calculates a similarity matrix between all column
vectors of a matrix x
. This matrix might be a document-term
matrix, so columns would be expected to be documents and
rows to be terms.
When executed on two vectors x
and y
,
cosine()
calculates the cosine similarity between them.
Returns a similarity matrix of cosine values, comparing all
column vectors against each other. Executed on two vectors, their
cosine similarity value is returned.
The cosine measure is nearly identical with the pearson correlation
coefficient (besides a constant factor) cor(method="pearson")
.
For an investigation on the differences in the context of textmining see
(Leydesdorff, 2005).
Fridolin Wild [email protected]
Leydesdorff, L. (2005) Similarity Measures, Author Cocitation Analysis,and Information Theory. In: JASIST 56(7), pp.769-772.
## the cosinus measure between two vectors vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) cosine(vec1,vec2) ## the cosine measure for all document vectors of a matrix vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) cosine(matrix)
## the cosinus measure between two vectors vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) cosine(vec1,vec2) ## the cosine measure for all document vectors of a matrix vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) cosine(matrix)
Methods for choosing a ‘good’ number of singular values for the dimensionality reduction in LSA.
dimcalc_share(share=0.5) dimcalc_ndocs(ndocs) dimcalc_kaiser() dimcalc_raw() dimcalc_fraction(frac=(1/50))
dimcalc_share(share=0.5) dimcalc_ndocs(ndocs) dimcalc_kaiser() dimcalc_raw() dimcalc_fraction(frac=(1/50))
share |
Optional: a fraction of the sum of the selected singular values to the sum of all singular values (default: 0.5). Only needed by |
frac |
Optional: a fraction of the number of the singular values to be used (default: 1/50th). |
ndocs |
Optional: the number of documents (only needed for |
In an LSA process, the diagonal matrix of the singular value decomposition is usually reduced to a specific number of dimensions (also ‘factors’ or ‘singular values’).
The functions dimcalc\_share()
, dimcalc\_ndocs()
, dimcalc\_kaiser()
and also the redundant function dimcalc\_raw()
offer methods to calculate a useful
number of singular values (based on the distribution and values of the given sequence
of singular values).
All of them are tightly coupled to the core LSA functions: they generates
a function to be executed by the calling (higher-level)
function lsa()
. The output function contains only one parameter,
namely s
, which is expected to be the sequence of singular values.
In lsa()
, the code returned is executed, the mandatory
singular values are provided as a parameter within lsa()
.
The dimensionality calculation methods, however, can still be called directly by adding a second, separate parameter set: e.g.
dimcalc\_share(share=0.2)(mysingularvalues)
The method dimcalc\_share()
finds the first position in the descending sequence of
singular values s
where their sum (divided by the sum of all
values) meets or exceeds the specified share.
The method dimcalc\_ndocs()
calculates the first position in the descending sequence
of singular values where their sum meets or exceeds the number of documents.
The method dimcalc\_kaiser()
calculates the number of singular values according to the
Kaiser-Criterium, i.e. from the descending order of values all values
with s[n] > 1
will be taken. The number of dimensions is returned
accordingly.
The method dimcalc_fraction()
returns the specified share of the
number of singular values. Per default, 1/50th of the available values
will be used and the determined number of singular values will be returned.
The method dimcalc\_raw()
return the maximum number of singular values (= the length
of s
). It is here only for completeness.
Returns a function that takes the singular values as a parameter to return the recommended number of dimensions. The expected parameter of this function is
s |
A sequence of singular values (as produced by the SVD). Only needed when calling the dimensionality calculation routines directly. |
Fridolin Wild [email protected]
Wild, F., Stahl, C., Stermsek, G., Neumann, G., Penya, Y. (2005) Parameters Driving Effectiveness of Automated Essay Scoring with LSA. In: Proceedings of the 9th CAA, pp.485-494, Loughborough
## create some data vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) s = svd(matrix)$d # standard share of 0.5 dimcalc_share()(s) # specific share of 0.9 dimcalc_share(share=0.9)(s) # meeting the number of documents (here: 3) n = ncol(matrix) dimcalc_ndocs(n)(s)
## create some data vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) s = svd(matrix)$d # standard share of 0.5 dimcalc_share()(s) # specific share of 0.9 dimcalc_share(share=0.9)(s) # meeting the number of documents (here: 3) n = ncol(matrix) dimcalc_ndocs(n)(s)
Additional documents can be mapped into a pre-exisiting latent semantic space without influencing the factor distribution of the space. Applied, when additional documents must not influence the calculated existing latent semantic factor structure.
fold_in( docvecs, LSAspace )
fold_in( docvecs, LSAspace )
LSAspace |
a latent semantic space generated by createLSAspace. |
docvecs |
a textmatrix. |
To keep additional documents from influencing the factor distribution
calculated previously from a particular text basis, they can be folded-in
after the singular value decomposition performed in lsa()
.
Background Information:
For folding-in, a pseudo document vector mi
of the new documents
is calculated into as shown in the equations (1) and (2) (cf. Berry et al., 1995):
(1)
(2)
The document vector in equation~(1) is identical to an additional
column of an input textmatrix
with the term frequencies of the
essay to be folded-in.
and
are the truncated matrices
from the SVD applied through
lsa()
on a given text
collection to construct the latent semantic space. The resulting vector
from equation~(2) is identical to an additional column in the
textmatrix representation of the latent semantic space (as produced by
as.textmatrix()
). Be careful when using weighting schemes: you
may want to use the global weights of the training textmatrix also for
your new data that you fold-in!
textmatrix |
a textmatrix representation of the additional documents in the latent semantic space. |
Fridolin Wild [email protected]
textmatrix
, lsa
, as.textmatrix
# create a first textmatrix with some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) matrix1 = textmatrix(td, minWordLength=1) unlink(td, recursive=TRUE) # create a second textmatrix with some more files td = tempfile() dir.create(td) write( c("cat", "mouse", "mouse"), file=paste(td, "A1", sep="/") ) write( c("nothing", "mouse", "monster"), file=paste(td, "A2", sep="/") ) write( c("cat", "monster", "monster"), file=paste(td, "A3", sep="/") ) matrix2 = textmatrix(td, vocabulary=rownames(matrix1), minWordLength=1) unlink(td, recursive=TRUE) # create an LSA space from matrix1 space1 = lsa(matrix1, dims=dimcalc_share()) as.textmatrix(space1) # fold matrix2 into the space generated by matrix1 fold_in( matrix2, space1)
# create a first textmatrix with some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) matrix1 = textmatrix(td, minWordLength=1) unlink(td, recursive=TRUE) # create a second textmatrix with some more files td = tempfile() dir.create(td) write( c("cat", "mouse", "mouse"), file=paste(td, "A1", sep="/") ) write( c("nothing", "mouse", "monster"), file=paste(td, "A2", sep="/") ) write( c("cat", "monster", "monster"), file=paste(td, "A3", sep="/") ) matrix2 = textmatrix(td, vocabulary=rownames(matrix1), minWordLength=1) unlink(td, recursive=TRUE) # create an LSA space from matrix1 space1 = lsa(matrix1, dims=dimcalc_share()) as.textmatrix(space1) # fold matrix2 into the space generated by matrix1 fold_in( matrix2, space1)
Calculates a latent semantic space from a given document-term matrix.
lsa( x, dims=dimcalc_share() )
lsa( x, dims=dimcalc_share() )
x |
a document-term matrix (recommeded to be of class textmatrix), containing documents in colums, terms in rows and occurrence frequencies in the cells. |
dims |
either the number of dimensions or a configuring function. |
LSA combines the classical vector space model — well known in textmining — with a Singular Value Decomposition (SVD), a two-mode factor analysis. Thereby, bag-of-words representations of texts can be mapped into a modified vector space that is assumed to reflect semantic structure.
With lsa()
a new latent semantic space can
be constructed over a given document-term matrix. To ease
comparisons of terms and documents with common
correlation measures, the space can be converted into
a textmatrix of the same format as y
by calling as.textmatrix()
.
To add more documents or queries to this latent semantic
space in order to keep them from influencing the original
factor distribution (i.e., the latent semantic structure calculated
from a primary text corpus), they can be ‘folded-in’ later on
(with the function fold_in()
).
Background information (see also Deerwester et al., 1990):
A document-term matrix is constructed
with
textmatrix()
from a given text base of documents
containing
terms.
This matrix
of the size
is then decomposed via a
singular value decomposition into: term vector matrix
(constituting
left singular vectors), the document vector matrix
(constituting
right singular vectors) being both orthonormal, and the diagonal matrix
(constituting singular values).
These matrices are then reduced to the given number of dimensions
to result into truncated matrices
,
and
— the latent semantic space.
If these matrices were multiplied, they would give a new
matrix
(of the same format as
, i.e., rows are the
same terms, columns are the same documents), which is the least-squares best
fit approximation of
with
singular values.
In the case of folding-in, i.e., multiplying new documents into a given
latent semantic space, the matrices and
remain unchanged
and an additional
is created (without replacing the old one).
All three are multiplied together to return a (new and appendable)
document-term matrix
in the term-order of
.
LSAspace |
a list with components ( |
Fridolin Wild [email protected]
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990) Indexing by Latent Semantic Analysis. In: Journal of the American Society for Information Science 41(6), pp. 391–407.
Landauer, T., Foltz, P., and Laham, D. (1998) Introduction to Latent Semantic Analysis. In: Discourse Processes 25, pp. 259–284.
as.textmatrix
, fold_in
, textmatrix
, gw_idf
, dimcalc_share
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") ) # LSA data(stopwords_en) myMatrix = textmatrix(td, stopwords=stopwords_en) myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix) myLSAspace = lsa(myMatrix, dims=dimcalc_share()) as.textmatrix(myLSAspace) # clean up unlink(td, recursive=TRUE)
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") ) # LSA data(stopwords_en) myMatrix = textmatrix(td, stopwords=stopwords_en) myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix) myLSAspace = lsa(myMatrix, dims=dimcalc_share()) as.textmatrix(myLSAspace) # clean up unlink(td, recursive=TRUE)
Display a one screen short version of a textmatrix.
## S3 method for class 'textmatrix' print( x, bag_lines, bag_cols, ... )
## S3 method for class 'textmatrix' print( x, bag_lines, bag_cols, ... )
x |
A textmatrix. |
bag_lines |
The number of lines per bag. |
bag_cols |
The number of columns per bag. |
... |
Arguments to be passed on. |
Document-term matrices are often very large and cannot be displayed completely on one screen. Therefore, the textmatrix print method displays only clippings (‘bags’) from this matrix.
Clippings are taken vertically and horizontally
from beginning, middle, and end of the matrix.
bag\_lines
lines and bag\_cols
columns
are printed to the screen.
To keep document titles from blowing up the display, the legend is printed below, referencing the symbols used in the table.
Fridolin Wild [email protected]
# fake a matrix m = matrix(ncol=800, nrow=400) m[1:length(m)] = 1:length(m) colnames(m) = paste("D",1:ncol(m),sep="") rownames(m) = paste("W",1:nrow(m),sep="") class(m) = "textmatrix" # show a short form of the matrix print(m, bag_cols=5)
# fake a matrix m = matrix(ncol=800, nrow=400) m[1:length(m)] = 1:length(m) colnames(m) = paste("D",1:ncol(m),sep="") rownames(m) = paste("W",1:nrow(m),sep="") class(m) = "textmatrix" # show a short form of the matrix print(m, bag_cols=5)
Create a query in the format of a given textmatrix.
query ( qtext, termlist, stemming=FALSE, language="german" )
query ( qtext, termlist, stemming=FALSE, language="german" )
termlist |
the termlist of the background latent-semantic space. |
language |
specifies a language for stemming / stop-word-removal. |
stemming |
boolean, specifies whether all terms will be reduced to their wordstems. |
qtext |
the query string, words are separated by blanks. |
Create queries, i.e., an additional term vector to be used for query-to-document comparisons, in the format of a given textmatrix.
query |
returns the query vector (based on the given vocabulary) as matrix. |
Fridolin Wild [email protected]
# prepare some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td,"D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td,"D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td,"D3", sep="/") ) # demonstrate generation of a query dtm = textmatrix(td) query("monster", rownames(dtm)) query("monster dog", rownames(dtm)) # clean up unlink(td, TRUE)
# prepare some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td,"D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td,"D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td,"D3", sep="/") ) # demonstrate generation of a query dtm = textmatrix(td) query("monster", rownames(dtm)) query("monster dog", rownames(dtm)) # clean up unlink(td, TRUE)
Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.
sample.textmatrix(textmatrix, samplesize, index.return=FALSE)
sample.textmatrix(textmatrix, samplesize, index.return=FALSE)
textmatrix |
A document-term matrix. |
samplesize |
Desired number of files |
index.return |
if set to true, the positions of the subset in the original column vectors will be returned as well. |
Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.
filelist |
a list of filenames of the documents in the corpus.). |
ix |
If index.return is set to true, a list is returned; |
Fridolin Wild [email protected]
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # create matrices myMatrix = textmatrix(td, minWordLength=1) sample(myMatrix, 3) # clean up unlink(td, recursive=TRUE)
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/")) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/")) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/")) write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/")) # create matrices myMatrix = textmatrix(td, minWordLength=1) sample(myMatrix, 3) # clean up unlink(td, recursive=TRUE)
This list contains entities (specialchars$entities
) and their replacement character (specialchars$replacement
), as used by textvector to cleanup html code: for example, this is used to replace the html entity \ä\; with the character ae. You can use this data set with data(specialchars)
.
data(specialchars)
data(specialchars)
list of language specific html entities and their replacement
Fridolin Wild [email protected]
This data sets contain very common lists of words that want to be ignored when
building up a document-term matrix. The stop word lists can be loaded by
calling data(stopwords_en)
, data(stopwords_de)
,
data(stopwords_nl)
, data(stopwords_ar)
, etc. The objects stopwords_de
, stopwords_en
, stopwords_nl
, stopwords_ar
, etc. must already exist before being handed over to textmatrix()
.
The French stopword list has been combined by Haykel Demnati by integrating the lists from rank.nl (www.rank.nl/stopwors/french.html), the one from the CLEF team at the University of Neuchatel (http://members.unine.ch/jacques.savoy/clef/frenchST.txt), and the one prepared by Jean Véronis (http://sites.univ-provence.fr/veronis/data/antidico.txt).
The Polish stopword list has been contributed by Grazyna Paliwoda-Pekosz, Cracow University of Economics and is taken from the Polish Wikipedia.
The Arab stopword list has been contributed by Marwa Naili, Tunisia. The list is based on the stopword lists by Shereen Khoja and by Siham Boulaknadel.
data(stopwords_de) stopwords_de data(stopwords_en) stopwords_en data(stopwords_nl) stopwords_nl data(stopwords_fr) stopwords_fr data(stopwords_ar) stopwords_ar
data(stopwords_de) stopwords_de data(stopwords_en) stopwords_en data(stopwords_nl) stopwords_nl data(stopwords_fr) stopwords_fr data(stopwords_ar) stopwords_ar
A vector containing 424 English, 370 German, 260 Dutch, 890 French stop, or 434 Arab words (e.g. 'he', 'she', 'a').
Fridolin Wild [email protected], Marco Kalz [email protected] (for Dutch), Haykel Demnati [email protected] (for French), Marwa Naili [email protected] (for Arab)
Return a summary with some statistical infos about a given textmatrix.
## S3 method for class 'textmatrix' summary( object, ... )
## S3 method for class 'textmatrix' summary( object, ... )
object |
A textmatrix. |
... |
Arguments to be passed on |
Returns some statistical infos about the textmatrix x
:
number of terms, number of documents, maximum length of a term,
number of values not 0, number of terms containing strange
characters.
matrix |
Returns a matrix. |
Fridolin Wild [email protected]
# fake a matrix m = matrix(ncol=800, nrow=400) m[1:length(m)] = 1:length(m) colnames(m) = paste("D",1:ncol(m),sep="") rownames(m) = paste("W",1:nrow(m),sep="") class(m) = "textmatrix" # show a short form of the matrix summary(m)
# fake a matrix m = matrix(ncol=800, nrow=400) m[1:length(m)] = 1:length(m) colnames(m) = paste("D",1:ncol(m),sep="") rownames(m) = paste("W",1:nrow(m),sep="") class(m) = "textmatrix" # show a short form of the matrix summary(m)
Creates a document-term matrix from all textfiles in a given directory.
textmatrix( mydir, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, minGlobFreq=FALSE, maxGlobFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE) textvector( file, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE )
textmatrix( mydir, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, minGlobFreq=FALSE, maxGlobFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE) textvector( file, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE )
file |
filename (may include path). |
mydir |
the directory path (e.g., |
stemming |
boolean indicating whether to reduce all terms to their wordstem. |
language |
specifies language for the stemming / stop-word-removal. |
minWordLength |
words with less than minWordLength characters will be ignored. |
maxWordLength |
words with more than maxWordLength characters will be ignored; per default set to |
minDocFreq |
words of a document appearing less than minDocFreq within that document will be ignored. |
maxDocFreq |
words of a document appearing more often than maxDocFreq within that document will be ignored; per default set to |
minGlobFreq |
words which appear in less than minGlobFreq documents will be ignored. |
maxGlobFreq |
words which appear in more than maxGlobFreq documents will be ignored. |
stopwords |
a stopword list that contains terms the will be ignored. |
vocabulary |
a character vector containing the words: only words in this term list will be used for building the matrix (‘controlled vocabulary’). |
removeXML |
if set to |
removeNumbers |
if set to |
phrases |
not implemented, yet. |
All documents in the specified directory are read and a matrix is composed. The matrix contains in every cell the exact number of appearances (i.e., the term frequency) of every word for all documents. If specified, simple text preprocessing mechanisms are applied (stemming, stopword filtering, wordlength cutoffs).
Stemming thereby uses Porter's snowball stemmer (from package SnowballC
).
There are two stopword lists included (for english and for german), which
are loaded on demand into the variables stopwords_de
and
stopwords_en
. They can be activated by calling data(stopwords_de)
or data(stopwords_en)
. Attention: the stopword lists have
to be already loaded when textmatrix()
is called.
textvector()
is a support function that creates a list of
term-in-document occurrences.
For every generated matrix, an own environment is added as an attribute which
holds the triples that are stored by setTriple()
and can be
retrieved with getTriple()
.
If the language is set to "arabic", special characters for the Buckwalter transliteration will be kept.
textmatrix |
the document-term matrix (incl. row and column names). |
Fridolin Wild [email protected]
wordStem
, stopwords_de
, stopwords_en
, setTriple
, getTriple
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) # read them, create a document-term matrix textmatrix(td) # read them, drop german stopwords data(stopwords_de) textmatrix(td, stopwords=stopwords_de) # read them based on a controlled vocabulary voc = c("dog", "mouse") textmatrix(td, vocabulary=voc, minWordLength=1) # clean up unlink(td, recursive=TRUE)
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) # read them, create a document-term matrix textmatrix(td) # read them, drop german stopwords data(stopwords_de) textmatrix(td, stopwords=stopwords_de) # read them based on a controlled vocabulary voc = c("dog", "mouse") textmatrix(td, vocabulary=voc, minWordLength=1) # clean up unlink(td, recursive=TRUE)
Allows to store, manage and retrieve SPO-triples (subject, predicate, object) bound to the document columns of a document term matrix.
getTriple( M, subject, predicate ) setTriple( M, subject, predicate, object ) delTriple( M, subject, predicate, object ) getSubjectId( M, subject )
getTriple( M, subject, predicate ) setTriple( M, subject, predicate, object ) delTriple( M, subject, predicate, object ) getSubjectId( M, subject )
M |
the document term matrix (see |
subject |
column number or column name (e.g., |
predicate |
predicate of the triple sentence (e.g., |
object |
value of the triple sentence (e.g., |
SPO-Triples are simple facts of the uniform structure (subject, predicate, object). A subject is typically a document in the given document-term matrix M, i.e. its document title (as in the column names) or its column position. A key-value pair (the predicate and the object) can be bound to this subject.
This can be used, for example, to store classification information about the documents of the text base used.
The triple data is stored in the environment of M
constructed by textmatrix()
.
Whenever a matrix has to be used which has not been generated by this function, its class should be set to 'textmatrix' and an environment has to be added manually via:
class(mymatrix) = "textmatrix"
environment(mymatrix) = new.env()
Alternatively, as.matrix()
can be used to convert a
matrix to a textmatrix. To spare memory, the manual method
might be of advantage.
In getTriple()
, the arguments subject and predicate
are optional.
textmatrix |
the document-term matrix (including row and column names). |
Fridolin Wild [email protected]
x = matrix(2,2,3) # we fake a document term matrix rownames(x) = c("dog","mouse") # fake term names colnames(x) = c("doc1","doc2","doc3") # fake doc titles class(x) = "textmatrix" # usually done by textmatrix() environment(x) = new.env() # usually done by textmatrix() setTriple(x, "doc1", "has_category", "15") setTriple(x, "doc2", "has_category", "7") setTriple(x, "doc1", "has_grade", "5") setTriple(x, "doc1", "has_category", "11") getTriple(x, "doc1") getTriple(x, "doc1")[[2]] getTriple(x, "doc1", "has_category") # -> [1] "15" "11" delTriple(x, "doc1", "has_category", "15") getTriple(x, "doc1", "has_category") # -> [1] "11"
x = matrix(2,2,3) # we fake a document term matrix rownames(x) = c("dog","mouse") # fake term names colnames(x) = c("doc1","doc2","doc3") # fake doc titles class(x) = "textmatrix" # usually done by textmatrix() environment(x) = new.env() # usually done by textmatrix() setTriple(x, "doc1", "has_category", "15") setTriple(x, "doc2", "has_category", "7") setTriple(x, "doc1", "has_grade", "5") setTriple(x, "doc1", "has_category", "11") getTriple(x, "doc1") getTriple(x, "doc1")[[2]] getTriple(x, "doc1", "has_category") # -> [1] "15" "11" delTriple(x, "doc1", "has_category", "15") getTriple(x, "doc1", "has_category") # -> [1] "11"
Calculates a weighted document-term matrix according to the chosen local and/or global weighting scheme.
lw_tf(m) lw_logtf(m) lw_bintf(m) gw_normalisation(m) gw_idf(m) gw_gfidf(m) entropy(m) gw_entropy(m)
lw_tf(m) lw_logtf(m) lw_bintf(m) gw_normalisation(m) gw_idf(m) gw_gfidf(m) entropy(m) gw_entropy(m)
m |
a document-term matrix. |
When combining a local and a global weighting scheme to be applied on a
given textmatrix m
via , where
is the given document-term matrix,
is one of the local weight functions
lw\_tf()
, lw\_logtf()
, lw\_bintf()
, and
is one of the global weight functions
gw\_normalisation()
, gw\_idf()
, gw\_gfidf()
, entropy()
, gw\_entropy()
.
This set of weighting schemes includes the local weightings (lw) raw, log, binary and the global weightings (gw) normalisation, two versions of the inverse document frequency (idf), and entropy in both the original Shannon as well as in a slightly modified, more common version:
lw\_tf()
returns a completely unmodified matrix (placebo function).
lw\_logtf()
returns the logarithmised matrix.
is applied on every cell.
lw\_bintf()
returns binary values of the matrix. Every cell is assigned 1, iff the term frequency is not equal to 0.
gw\_normalisation()
returns a normalised matrix. Every cell equals 1 divided by the square root of the document vector length.
gw\_idf()
returns the inverse document frequency in a matrix. Every cell is 1 plus the logarithmus of the number of documents divided by the number of documents where the term appears.
gw\_gfidf()
returns the global frequency multiplied with idf. Every cell equals the sum of the frequencies of one term divided by the number of documents where the term shows up.
entropy()
returns the entropy (as defined by Shannon).
gw\_entropy()
returns one plus entropy.
Be careful when folding in data into an existing lsa space: you may want to weight an additional textmatrix based on the same vocabulary with the global weights of the training data (not the new data)!
Returns the weighted textmatrix of the same size and format as the input matrix.
Fridolin Wild [email protected]
Dumais, S. (1992) Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. Technical Report, Bellcore.
Nakov, P., Popova, A., and Mateev, P. (2001) Weight functions impact on LSA performance. In: Proceedings of the Recent Advances in Natural language processing, Bulgaria, pp.187-193.
Shannon, C. (1948) A Mathematical Theory of Communication. In: The Bell System Technical Journal 27(July), pp.379–423.
## use the logarithmised term frequency as local weight and ## the inverse document frequency as global weight. vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) weighted = lw_logtf(matrix)*gw_idf(matrix)
## use the logarithmised term frequency as local weight and ## the inverse document frequency as global weight. vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 ) vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 ) matrix = cbind(vec1,vec2, vec3) weighted = lw_logtf(matrix)*gw_idf(matrix)