Microsoft latest contribution to open source, Space Partition Tree And Graph

(SPTAG), is an implementation of the approximate nearest neighbor search (NNS) algorithm that is used in Microsoft Bing search engine

.

In sheer mathematical terms, SPTAG is able to efficiently find those vectors in a given set that minimize the distance from a query vector. In reality, SPTAG does an approximate NN search, meaning it takes a guess at which vectors are the nearest neighbors and does not guarantee to return the actual nearest neighbors. This, in exchange, improves the algorithms requirements in terms of memory and time consumption.

Hands down, this means you can use SPTAG to search for similar data, whereby vectors are a numerical representation of your data points that can be organized according to their relative distance. Now, if you chose wisely a mapping from your data points to a vector space, that distance can be assumed to be representative of the similarity between the corresponding data points.

Vectorizing your data has an interesting property in comparison with traditional keyword search, in that it enables comparing heterogeneous data points such as text, page snippets, images, and other media. All of them are mapped to vectors the algorithm treats without concern for their original format. This has a somewhat striking effect, as Microsoft explains:

For example, if a user types in "How tall is the tower in Paris?" Bing can return a natural language result telling the user the Eiffel Tower is 1,063 feet, even though the word "Eiffel" never appeared in the search query and the word "tall" never appears in the result.

SPTAG is not just a lab experiment and its strength has been largely stressed in Bing, which handles more than 150 billions data points that are vectorized and fed to SPTAG. According to Bing program manager Jeffrey Zhu, SPTAG is potentially able to provide the most related results searching through 100 billion vectors in less than 5 milliseconds.

Of course, it should not go unnoticed that mapping your data points to vectors is a key step towards using vector search effectively, and Microsoft algorithm will not be of any help as to that. Moreover, Microsoft is understandably not providing any insights about how Bing maps Web data to vectors, which is where part of the magics behind search lies. At this respect, Microsoft only says Bing uses deep neural networks to train the mapping and considers user-clicks on search results to better understand their meaning and their relevance to the original user query. A known algorithm that is used to map a text corpus to a vector space is Word2vec

, with implementations available in a number of languages.

SPTAG is not the only approximate NNS algorithm implementation out there. In particular, Facebook Research released their implementation of a similar algorithm

, called Faiss

, that also touts great performance on billion vector datasets.

【阅读原文...】