技术控

    今日:52| 主题:49409
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Google Patents Context Vectors to Improve Search

[复制链接]
沧海桑田 发表于 2016-10-5 17:11:26
159 5

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x

Google Patents Context Vectors to Improve Search-1 (discusses,available,exercises,describes,internet)
      For example, a horse to a rancher is an animal. A horse to a carpenter is an implement of work. A horse to a gymnast is an implement on which to perform certain exercises.      One of the limitations of information on the Web is that it is organized differently at each site on the Web. As a newly granted Google patent notes, there is no official catalog of information available on the internet, and each site has its own organizational system. Search engines exist to index information, but they have issues, as described in this new patent that make finding information challenging.
  Limitations on Conventional Keyword-Based Search Engines

  The patent granted to Google, in September of 2016, discusses a way to organize information on the Web in a manner which can help to better organize and index that information. The patent describes limitations of search engines that are based upon indexing content using keywords, such as:
  
       
  • A search engine working with Conventional keyword searching will return all instances of the keyword being searched for, regardless of how that word is used on a site. This can be a lot of results   
  • Conventional search engines may only return only the home page of a site that contains the keyword. Finding where the keyword is used on the site could be difficult   
  • Often a conventional search engine will return a list of URLs in response to a keyword search that may be difficult to modify or search further in a meaningful manner.   
  • Information obtained through a search can become dated quickly. Such information may need to be checked up upon  
  The patent tells us about those limitations and also points out some of the limitations of directories that could also be used to help find information. It then goes on to provide a possible solution to this problem, with a “data extraction tool” capable of providing many of the benefits of both search engines and directories, without the drawbacks that this patent points out.
  Is this The Google Search Engine with RankBrain Inside?

  A search engine based on a data extraction tool like the one described in the patent would be an improvement over most search engines. Is this Google’s search engine with RankBrain applied to it? It’s possible that it is, though it doesn’t use the word RankBrain
   The Bloomberg introduction to RankBrain, Google Turning Its Lucrative Web Search Over to AI Machines provides information about the algorithm used in RankBrain, and it tells us:
  RankBrain uses artificial intelligence to embed vast amounts of written language into mathematical entities — called vectors — that the computer can understand.
  This new patent refers to what it calls Context Vectors to index content about words found on the Web. To put it clearly, the patent tells us:
  In view of the foregoing, in accordance with the invention as embodied and broadly described herein, a method and apparatus are disclosed in one embodiment of the present invention for determining contexts of information analyzed. Contexts may be determined for words, expressions, and other combinations of words in bodies of knowledge such as encyclopedias. Analysis of use provides a division of the universe of communication or information into domains and selects words or expressions unique to those domains of subject matter as an aid in classifying information. A vocabulary list is created with a macro-context (context vector) for each, dependent upon the number of occurrences of unique terms from a domain, over each of the domains. This system may be used to find information or classify information by subsequent inputs of text, in calculation of macro-contexts, with ultimate determination of lists of micro-contests including terms closely aligned with the subject matter.
  When a search submits a query to a search engine, we are told that the search engine may try to give it contexts based upon “other queries from the same user, the query associated iwth other information or query results from the same use, or other inputs related to that user to give it more context.
  The patent is:
   User-context-based search engine
  Inventors: David C. Taylor
  Application Date: 14.09.2012
  Grant Number: 09449105
  Grant Date: 20.09.2016
  Abstract:
  A method and apparatus for determining contexts of information analyzed. Contexts may be determined for words, expressions, and other combinations of words in bodies of knowledge such as encyclopedias. Analysis of use provides a division of the universe of communication or information into domains and selects words or expressions unique to those domains of subject matter as an aid in classifying information. A vocabulary list is created with a macro-context (context vector) for each, dependent upon the number of occurrences of unique terms from a domain, over each of the domains. This system may be used to find information or classify information by subsequent inputs of text, in calculation of macro-contexts, with ultimate determination of lists of micro-contests including terms closely aligned with the subject matter.
   When RankBrain was first announced, I found a patent that was co-invented by one of the members of the team that was working on it, that described how Google might provide substitutions for some query terms, based upon an understanding of the context of those terms and the other words used in a query. I wrote about that patent in the post, Investigating Google RankBrain and Query Term Substitutions . I think reading the patent that post is about, and the one that this post is about can be helpful in understanding some of the ideas behind a process such as RankBrain.
  This patent does provide a lot of insights in explaining the importance of context and how helpful that can be to a system that may be attempting to extract data from a source and index that data in a way which makes it easier to locate. I liked this passage in particular:
  Interestingly, some words in the English language, and other languages pertain to many different areas of subject matter. Thus, one may think of the universe of communication as containing numerous domains of subject matter. For example, the various domains in FIG. 2 refer to centers of meaning or subject matter areas. These domains are represented as somewhat indistinct clouds, in that they may accumulate a vocabulary of communication elements about them that pertain to them or that relate to them. Nevertheless, some of those same communication elements may also have application elsewhere. For example, a horse to a rancher is an animal. A horse to a carpenter is an implement of work. A horse to a gymnast is an implement on which to perform certain exercises. Thus, the communication element that we call “horse” belongs to, or pertains to, multiple domains.
  A search engine that can identify the domains or contexts that a word might fit within may be able to better index such words; as described in this patent:
  In an apparatus and method in accordance with the invention, a search engine process is developed that provides a deterministic method for establishing context for the communication elements submitted in a query. Thus, it is possible for a search engine now to determine to which domain or domains a communication element is “attracted.” Since few things are absolute, domains may actually overlap or be very close such that they man share certain communication elements. That is, communication elements do not “belong” to any domain, they are attracted to or have an affinity for various domains, and may have differing degrees of affinity for differing domains. One may think of this affinity as perhaps a goodness of fit or a best alignment or quality alignment with the subject matter of a particular domain.
  Contextually Rewarding Search Results

  The patent tells us that a search engine that works well is one that provides a searcher with information in response to a query that is “comparatively close related”. Information that is exactly what has been sought. Then information that is close to what has been sought and is still useful. Then it tells us that what would be “contextually unrewarding” would be information that shares the word in a completely different and useless context related to the query
  Words might be related to a wide range of particular fields or subject matter domains. The patent describes how these might be used:
  Typically, a domain list of about 40 to 50 terms have been found to be effective. Some domain lists have been operated successfully in an apparatus and method in accordance with the invention with as few as 10 terms. Some domain lists may contain a few hundreds of individual terms. For example, some domains may justify about 300 terms. Although the method is deterministic, rather than statistical, it is helpful to have about 40 to 50 terms in the domain list in order to improve the efficiency of the calculations and determinations of the method.
  The domain lists have utility in quickly identifying the particular domain to which their members pertain. This results from the lack of commonality of the terms and the lack of ambiguity as to domains to which they may have utility. By the same token, a list as small as the domain lists are necessarily limited when considering the overall vocabulary of communication elements available in any language. Thus, the terms in domain lists do not necessarily arise with the frequency that is most useful for rapid searching. That is, a word that is unique to a particular subject matter domain, but infrequently used, may not arise in very many queries submitted to a search engine.
  A process for creating a vocabulary list of a substantial universe or a substantial portion of a universe of communication elements may be performed by identifying a body or corpus of information organized by topical entries. Thereafter, the text of each of those entries identified may be subjected to a counting process in which occurrences of terms from the domain list occur within each of the topical entries. Ultimately, a calculation of a macro context may be made for each of the topical entries. This calculation is based on the domain lists, and the domains represented thereby.
  This is where this patent enters into the world of the Semantic Web. The places where different subject matter domains may be identified for different words could be in knowledge bases or online encyclopedias. Such collections of what is referred to as public knowledge might be called a “corpus”. This kind of corpus of information could be used to create a context vector used to index different meanings of words.
  When a different meaning is found, it might then be counted from that information corpus The patent tells us that terms found in such a place could be “individual words, terms, expressions, phrases, and so forth.”
  The patent attempts to put this into context for us with this statement:
  One may think of a topical entry as a vocabulary term. That is, every topical entry is a vocabulary word, expression, place, person, etc. that will be added to the overall vocabulary. That is, for example, the universe may be divided into about 100 to 120 domains for convenient navigation. Likewise, the domain lists may themselves contain from about 10 to about 300 select terms each. By contrast, the topical entries that may be included in the build of a vocabulary list may include the number of terms one would find in a dictionary such as 300 to 800,000. Less may be used, and conceivably more. Nevertheless, unabridged dictionaries and encyclopedias typically have on this order of numbers of entries.
  Contexts as Vectors

   When RankBrain first came out, there was a post published that looked at some information that might make it a little more understandable; it included some information about Geoffrey Hinton’s Thought Vectors, and there’s more about those in this post from Jennifer Slegg: RankBrain: Everything We Know About Google’s AI Algorithm .
   There is a Google Open Source Blog post on Word Vectors which is closely related, titled Learning the meaning behind words , written by Tomas Mikolov, Ilya Sutskever, and Quoc Le. Ilya Sutskever was a student of Geoffry Hinton. Tomas Mikolov worked on a number of papers about word vectors while with the Google Brain team, including Efficient Estimation of Word Representations in Vector Space .
  The patent spends a fair amount of time describing what it considers context vectors to be; the different domains which a word might fall into, and number of occurrences or weights for those words within those domains. It’s worth drilling down into the patent and reading about how terms can be considered context vectors that a search engine might label them as.
  When a searcher enters a query into a search engine to be searched, the query may be classified within contexts, to help in selecting information in response to that query.
  Using a Browser Helper Object

  The patent describes how it might identify different domains that might be associated with specific terms. It tells us that this might be done:
  By compiling a list of domain-specific questions, it is possible to (1) specify differences between very similar domains with great precision, and (2) create a rapid way to prototype a domain that does not require many hours of an expert’s time, and can be expanded by relatively inexperienced people.
  The patent also describes the use of a BHO (Browser Helper Object) in this manner:
  Another slightly more complex implementation is something like a Browser Helper Object (BHO) that runs on the user’s machine and watches/categorizes all surfing activity. With this system, even non-participating sites can contribute to the picture of the user, and any clicking the user does to ad sites served by certified clicks will pick up a much more comprehensive picture.
  The patent provides more details on how this contextual vector based system might work, and how data might be extracted from web pages. It is highly recommended reading if you want to get a better sense of how a context-based system might be used to index the web and to make specific information on the Web easier to improve upon most conventional keyword-based search engines.
友荐云推荐




上一篇:Using Transport Rules to Block Outbound Email to Untrustworthy Domains
下一篇:小白学react之页面BaseLayout框架及微信的坑
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

侯雪燕 发表于 2016-10-5 18:15:29
不作不死,No zUo No Die
回复 支持 反对

使用道具 举报

贺鹏 发表于 2016-10-5 19:54:04
不要羡慕,不要崇拜,更不要粉我,贺鹏不是归人,只是过客
回复 支持 反对

使用道具 举报

稻草人123456 发表于 2016-10-5 22:53:06
沧海桑田人气很旺!
回复 支持 反对

使用道具 举报

hwwu8934 发表于 2016-10-20 08:50:12
画面太美我不敢看
回复 支持 反对

使用道具 举报

贾乔 发表于 2016-11-12 10:00:36
我是一楼,楼下的排队跟上。
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表