1. Jon Kleinberg:Authoritative sources in a hyperlinked environment, JACM 1999
Links: 链接分析算法之:HITS算法
In this paper, the author states a new mechanism of searching algorithm which later was named as HIST Algorithm. HIST can mainly solve the problem that when user query some broad topics, the return result is too much and most of them is not relevant enough. In comparison with the textual based searching mechanism, HIST is considered as a textual and link analysis based searching algorithm.
The paper first introduced that there are three kind of query form: Specific queries, Broad-topic queries and Similar-page queries. The issue in Specific queries is that the searching is too specific which will lead to a small number of searching result. This can add difficulty to the search engine. Broad-topic queries has the issue that there are too many result which lead to a low efficiency and quality of the search. And also the relevance sometimes is not match with the popularity. So if the search engine is only based on the textual search, might give user a lot of useless results. The paper propose two new definition of the web pages: Hub and Authority. Authority web pages means pages contain the keywords users are searching and also has high relevance and high reputation. For example, when searching “Honda”, Honda’s official web page should be the top authority. A Hub web page are the pages contain links point to multiple relevant authorities. Such as Google can be consider as an Hub page. A good hub is a page that point to many good authority and a good authority is page that is pointed by many good hub pages. The HIST algorithm is based on such theory.
The first step of the HIST is to get a potential based web page set to apply the HIST and this set is much smaller than the original base web set. HIST extract such subgraph by using such method: (1) First use the traditional textual search to get a certain number of search result as the root collection. The root collection has less web pages and most of them are strong authorities. (2) Then base on this root set, HIST add both the web pages point to the element on the set and the web pages pointed by the elements in the set. (3) Finally filter out the intrinsic web pages and leave the transverse pages. Use this method, we can create our smaller searching base collection which is better for HIST to do large computation of the link based analysis.
Second step is to recalculate the Authority weight and the hub weight. The Authority weight is the sum of all the hub weight of the page point to the Authority page. The Hub weight is the sum of all the authority of the page the hub page point to. So if a hub point to many authority hubs, it should has a higher value of the weight. Also same for the authority weight. Then normalize the hub and authority weight. Do this to every pages in the subgraph and finally, the overall weight should be converge to a certain value. Then the pages in subgraph has been assigned with the evaluation value. Eventually, return user with the pages in the rank of the authority weight. This method actually can drag the whole subgraph to a two point directed graph. The left is the hub and the right is the authority. HIST can classify the hub and authority. This algorithm also has been proved in the paper. HIST also can apply to the third kind of search: similar web page search. It can return better result in comparison with textual search.
Also, HIST is a classifier algorithm. It can also be used in any other realm which can be simplified as directed graph. The author also introduced several area such as social network and publication citations.
This paper let me learn the link based search algorithm. This is more subjective in comparison with the objective textual search. This paper also let me have a better understanding of the web page: every thing in the web can be con information. The search can based on actual text and even the relationship between each other.
2. Scott Deerwester et al.:Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, 1990
Links: Latent semantic analysis note, 奇异值分解
This paper presents a new mechanism called Latent Semantic Analysis to study the relationship between documents and apply it to the query of the web information. Unlike the traditional analysis of the text information, LSA study the potential relation between indexing words across all the documents. This method can return more reliable search result when user query some text with less overlap between the original indexing words and also reduce the noise.
The paper first states the drawbacks of the traditional retrieval method. The major issue here is that the synonymy and polysemy are exist. Sometime the superficial meaning understand by the computer is different from what actually human understand. So the latent semantic relationship will be more reliable and important when help computer understand the human’s text. The current search engine haven’t indexed all the words in the reality world and it is bad at handling synonymy and polysemy in a big collection of documents. The paper show an example to prove that the current search mechanism has a lot of mis-matching. The aim of the LSA is to understand what latent keywords of what user is searching about and the exclude the word match but meaning not match cases.
The LSA use the two mode factor analysis as the representation model for both document and the query. Two mode factor analysis meets the requirement of: (1) The representational power is controllable (2) Have an explicit representation of terms and document (3) computable for large database. The later latent analysis is based on the Singular-value decomposition (SVD). SVD is a method to decompose an non-square matrix into three sub-matrixes. By using this, we can get the eigenvalues of the MxN matrix in the diagonal sub-matrix in the middle. Eigenvalues represent the feature of the original matrix and LSA is based on the change and analysis the eigenvalue of the original representational matrix.
The procedure of the LSA is: (1) construct the term-document matrix based on the document. The term should be the frequently used terms and the collect the frequency of each term appear in each documents. In this way, we get our document matrix. We can notice that there are some synonymy in our term collection. Such as “Computer” is some kind of related to the term “system” and “human” is related to “user”. We can also notice that these pair of synonymies follow the assumption of LSA that: If a pair of terms both show up or not show up simultaneously in a collection of documents, these two term are considered latent related. “Computer” and “System” follow this pattern in 6 of 9 document in this example. Also we can determine a quantization of the similarity between two terms by dot product their row vectors. (2) Then we should apply SVD to the big matrix we get and remove the k smallest eigenvalues in the middle sub-matrix and reduce the size of the other two. By reducing the number of eigenvalues, we reduce the noise which is the interference in the base matrix. Then we reconstruct the new term-document matrix by multiply the new sub-matrix back. In the new based matrix, the latent relationship like “Human” and “user” has been enhanced. In the column vector of C2, we can see that although C2 doesn’t contain “Human” but it still has weight value of 0.40 of containing the “Human” since this document contain “user”. By now we get the latent semantic matrix and we can apply user’s query on this matrix. (3) Then we get the query from the user and convert the query string into the representation form with the indexing terms as the term-document matrix. We convert it by multiply the statistic of the frequency of terms in the query with the first two sub-matrixes we get from the SVD of the original term-document matrix. (4) Finally, we compute the converted query vector’s angle with the each column vector (each document) in the new term-document matrix and return the N smallest angle document as searching results.
From the paper I learn the latent semantic analysis which can tell the potential relationship between documents. This can help the search engine has a better and user-friendly searching results. This help me has a deeper understanding of the text analysis.