Graph of a set of 1000 documents, arranged by similarity.

How can you organize and extract information from huge data sets of digital text documents? How do you develop automatic recommendations based on preference history? Could Google ever personalize your search results to reflect your previous interests? Essentially, this is an applied math problem, and it uses a relatively new technique called diffusion geometry. Here’s an

article by Ronald Coifman and Mauro Maggioni, two leaders in the field.

Basically, we want to understand the geometry of the data. We would like data points to lie on a manifold — a smooth surface in high-dimensional space, representing some kind of relatively simple rule (just as a sphere represents the rule “all points are the same distance from the origin.”) Then, we want to guess functions on the data from a few samples, with the goal of predicting values of the functions at new points.

The first step is to define a “similarity” function on the data. For example, if you have a thousand articles, a data point might be a vector consisting of the frequency of each word in the article (note that these vectors are huge!) and the similarity might be the correlation between word vectors when larger than 0.95, and zero otherwise.

Then, if we normalize the similarity, it’s a Markov chain; it describes a random walk, the probability of “jumping” from one document to a similar one. We can imagine a drunken ant crawling at random from one node to another, hitting similar, nearby nodes most often, but occasionally venturing to farther ones.

The eigenvectors of this Markov chain can be used as new coordinates for the data set; we can make a map of the data in space so that the distances between points equal the “diffusion distances” on the original data, the probabilities of reaching one node from another. This gives the desired representation of the data on a low-dimensional surface.

What this means is that the data can create its own rules and categories. Instead of labeling articles “business” or “sports,” we’ll see a cluster in the data that forms its own label. It’s a kind of spontaneous organization.

Here’s a nifty tutorial from Duke; for more mathematical detail, check out this article.

### Like this:

Like Loading...

*Related*