Sequential Text-Term Selection in Vector Space Models
2021.01.02【Publication Time】2021.01.02
【Lead Author】Feifei Wang
【Corresponding Author】Jingyuan Liu
【Journal】JOURNAL OF BUSINESS & ECONOMIC STATISTICS
【Abstract】
Text
mining has recently attracted a great deal of attention with the accumulation
of text documents in all fields. In this article, we focus on the use of
textual information to explain continuous variables in the framework of linear
regressions. To handle the unstructured texts, one common practice is to
structuralize the text documents via vector space models. However, using words
or phrases as the basic analysis terms in vector space models is in high
debate. In addition, vector space models often lead to an extremely large term
set and suffer from the curse of dimensionality, which makes term selection
important and necessary. Toward this end, we propose a novel term screening
method for vector space models under a linear regression setup. We first split
the entire term space into different subspaces according to the length of terms
and then conduct term screening in a sequential manner. We prove the screening
consistency of the method and assess the empirical performance of the proposed
method with simulations based on a dataset of online consumer reviews for
cellphones. Then, we analyze the associated real data. The results show that
the sequential term selection technique can effectively detect the relevant
terms by a few steps.
【Keywords】
Screening consistency,Term selection,Text mining,Vector space models