In the era of information and digital technologies, massive amounts of data are being
generated
at almost every level of applications in almost every area of disciplines. Quite often
the data collected
from complex phenomena represent the integrated result of several interrelated variables,
whereas these
variables are less precisely defined. Ambiguities in the variables, noises in the
measurement, variations
throughout the parsing and indexing, and even the insufficiency of some information
render considerable
uncertainties in the data. Without parallel gains in techniques for effectively
organizing or sorting such
data, the gains in the amount of information would simply be an inundatory deluge.
Extracting interesting
information from raw data, generally known as the data mining, therefore becomes an
indispensable task.
The principal objective in data mining is to distinguish which variable is related to
which and how the
variables are related. In many situations the digitized information is gathered and
stored as a data matrix.
It is often the case, or so assumed, that the exogenous variables depend on the
endogenous variables in a
linear relationship. Retrieving useful information therefore can often be characterized
as finding suitable
matrix factorization.
In this talk, we offer a synoptic view on how linear algebra techniques can help to carry
out the task of
data mining. Examples from factor analysis, cluster analysis, and latent semantic
indexing are used to
demonstrate how matrix factorization helps to uncover hidden connection and do things
fast. Low rank
matrix approximation plays a fundamental role in cleaning the data and compressing the
data. Other types
of constraints, such as nonnegativity, will also be briefly discussed. Finally, link
analysis is a necessary
dynamical system and desideratum by which we classify and rank trust and significance of
retrieved data.
|