Skip to main content

Authors in a Markov matrix Part 2 (3) Experimental results: Which author do people find most inspiring?


In this article, I will explain about our implementation and how the adjacency matrix looks like.

Implementation

We implemented the following programs:

  • Link_Vector_Extractor: Generate the author vector from the data.
  • Graph_Extractor: Generate the adjacency matrix from the data and the author vector.
  • Page_Rank: Compute page rank.
  • Remapper: Re-map the author vector according to the page rank result.

Our experiment is done with the computer CPU: Intel(R) Core(TM) DUO CPU P8400, 2 Cores, OS: 64bit Linux 3.2.0.32, Kubuntu 12.04. We used the following software: Python 2.7.3, Beautiful Soup 4.0.2, matlab R2006a, octave 3.2.4.

Adjacency matrix

Figures 2, 3, 4 show the adjacency matrices. In these Figures, blue points represent the connection.
Figure 2: Adjacency matrices. Top to bottom: German authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia.org.

Figure 3: Adjacency matrices. Top to bottom: English authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia.org.
Figure 4: Adjacency matrices. Top to bottom: Japanese authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia.org.
In Figure 2, German author, en.wikipedia.org has a regular pattern.  We haven't check what exactly causes this, however, we think this is the same problem of the template bias (will be discussed later) (Note 1). The German author in en.wikipedia.org has another peculiarity that we can see the higher number of links compare to other Wikipedia data shown in Table 2. We count the number of links is how many links are connected to the author vector entry, not all the links in the page since these are the connection in the adjacency matrix. For example, we didn't count the links to other language pages, self links, author links that were not on the author vector. Therefore, if the page has links to non-author person, we didn't count these links.
Table 2: Matrix size, non zero elements, and the average number of links between authors. Wiki ``en'' means en.wikipedia.org.

We removed rank sink nodes according to the PageRank algorithm [1]. We removed links related rank sink node pages, therefore we also removed so called dangling links (2.7 in [2]), since the links refereeing single sink rank node are the dangling links.  Additionally, we removed pages that has only outgoing links. In the original PageRank paper [2], nodes that have only outgoing links are not mentioned, though this doesn't make sense in the PageRank concept. However, it is easy to imagine the reason why the paper didn't mentioned about them. Since the original paper is for the web, it is hard to determine a node is not linked from any pages. To determine this, all the pages are considered. But, this is practically impossible on the web, therefore it is natural that the original paper didn't consider this outgoing link only nodes. However, our author vector has a limited size and we know all valid pages, therefore, we can remove these nodes. The link normalization value of PageRank calculation depends on whether considering these links or not. This means, it may affect the absolute value of the PageRank. However we are not interested in the absolute value of the PageRank. We are rather interested in the relative value, which author's influence is larger. The PageRank paper also mentioned this normalization has not large effect since they are also interested in the relative influence.

The matrix size is reduced by this operation, the result size is shown in Table 3.
Table 3: Adjacency matrices: original size, reduced size, and its rank.

Finally, I will show you who is the most important author in the next article (in a Wikipedia sense).

Note 1: When I wrote this article in my blog, my friend, Joerg M. pointed out this possibility (2012-12-29). Thanks.

Comments

Popular posts from this blog

Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that... Explanation based on null space This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation. Assume  x  where x is in the null space of A .  The matrices ( A^{T} A ) and A share the null space as the following: This means, if x is in the null space of A , x is also in the null spa

Gauss's quote for positive, negative, and imaginary number

Recently I watched the following great videos about imaginary numbers by Welch Labs. https://youtu.be/T647CGsuOVU?list=PLiaHhY2iBX9g6KIvZ_703G3KJXapKkNaF I like this article about naming of math by Kalid Azad. https://betterexplained.com/articles/learning-tip-idea-name/ Both articles mentioned about Gauss, who suggested to use other names of positive, negative, and imaginary numbers. Gauss wrote these names are wrong and that is one of the reason people didn't get why negative times negative is positive, or, pure positive imaginary times pure positive imaginary is negative real number. I made a few videos about explaining why -1 * -1 = +1, too. Explanation: why -1 * -1 = +1 by pattern https://youtu.be/uD7JRdAzKP8 Explanation: why -1 * -1 = +1 by climbing a mountain https://youtu.be/uD7JRdAzKP8 But actually Gauss's insight is much powerful. The original is in the Gauß, Werke, Bd. 2, S. 178 . Hätte man +1, -1, √-1) nicht positiv, negative, imaginäre (oder gar um

Why parallelogram area is |ad-bc|?

Here is my question. The area of parallelogram is the difference of these two rectangles (red rectangle - blue rectangle). This is not intuitive for me. If you also think it is not so intuitive, you might interested in my slides. I try to explain this for hight school students. Slides:  A bit intuitive (for me) explanation of area of parallelogram  (to my site, external link) .