Skip to main content

Authors in a Markov matrix Part 2 (1) Experimental results: Which author do people find most inspiring?


This is the part 2 of the article, experimental results. Until the last article, I talked about the question, ``Which author do people find most inspiring?'' From now on, I would like to talk about an answer.

Analyzing relationships between authors

Author graph generation method

We apply eigenanalysis on Japanese, English, and German authors to find  out which author do people find most inspiring in the literature in a sense of author network topology. First we need to generate an author graph that represents the relationships between authors. Of course we could generate such graph by hand, i.e., researching a lot of documents about authors. However, the number of famous Japanese authors maybe more than 1000. This is just our Sunday fun hobby project, we don't have enough time to do that.

Fortunately, nowadays we can use cloud knowledge. The natural solution seems to be using the information of Wikipedia. We can generate an adjacency matrix from the Wikipedia's link structure, then apply eigenanalysis to analyze the relationships between authors.

Assumption of this experiment

We assume the link structure of author pages in Wikipedia represents the relationships between authors.
This is a debatable assumption. We return to the first question ``What is the relationships between authors?'' in the Part 1 of this article. We define that the relationships of authors are given by the link structure of Wikipedia. Our intuition of this assumption is based on the idea: when a writer of Wikipedia made a link between authors, the writer thought there were some relationships between these authors. If this assumption cannot be accepted, the following experiment has no meaning. So we sometimes say, ``in a sense of Wikipedia link structure, ...'' in this article. So far, we believe this is a good method to find the relationships between authors and we don't have better idea to tackle this problem. When a better method is found, we can discuss this assumption again.

Based on this assumption, we will construct an adjacency matrix based on the link structure of Wikipedia and analyze it by eigenanalysis.

The advantage and disadvantage of this method are:

Advantage:


  1. Data size: We can use a relatively large digital data
  2. Correctness: Wikipedia pages are public and some review has been done
  3. Quality: We can expect there are some meaning in the link structure since these pages are made by human

Disadvantage:


  1. Error possibility: There could be errors in the link structure
  2. Wikipedia writer bias: Some Wikipedia writer may put some kind of bias depends on their preference
  3. Wikipedia edit guideline bias: Wikipedia's editing guideline may cause some kind of bias

The most attractive advantage for us is the large size data availability. If we try to construct an adjacency matrix of Japanese authors, we need to read a huge amount of literature and extract the relationships, or if we were fortunate, we would be able to find a book describing the author relationships, still we need to convert the data to digital processing possible form.

The disadvantage 1 can not be avoidable from any data source, though Wikipedia may have more errors than academically reviewed data source. The disadvantage 2 is a kind of nature of Wikipedia, we could not avoid this kind of bias. However, Wikipedia's other nature is not only one person is writing a page, thus, we hope this bias is not so severe. We need to explain the disadvantage 3. What is the edit guideline bias? The Wikipedia edit guideline recommends to add some specific links in the page, this may cause some bias. We will see such example in the result section. Although what is the definition of bias is a difficult problem. Even we thought there is some bias in the pages, others may not see there is a bias. We need to be careful adjusting bias. This adjusting may re-interpret the Wikipedia data. In a sense, this could be a filter of the observer. We may put some bias to change the Wikipedia's link structure under the name of removing bias. Although, this is our hobby project and as long as we write what kind of operation we did on the link structure data, we think it is fine. Whenever we altered some link structures, we will mention it.

Here, we can understand that all the disadvantages are a kind of link connection error. This error is difficult to detect if we can use only one source data. Of course what is the correct connection is beyond of this article. We have defined the connection is in the Wikipedia. Although, do we have only one data source?  No. We have several data sources. Wikipedia provides the same kind of data in other language Wikipedia. For example, Japanese authors are also listed in the English Wikipedia. Of course, Japanese author data is richer in Japanese Wikipedia than other language Wikipedia. Moreover, many Japanese author pages in English Wikipedia might be just translated from Japanese Wikipedia. This suggests that the data are not independent. If a English page is the exact translation of Japanese corresponding page, then the same link error would be there. In that case, we cannot detect the link error. However, as far as we see, these data are not totally independent data, but these data are not all the exact translation, i.e., we see some dependency. We should care the dependency of the data, but we think we can still use these data for a validation in some extent.

Comments

Popular posts from this blog

Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that... Explanation based on null space This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation. Assume  x  where x is in the null space of A .  The matrices ( A^{T} A ) and A share the null space as the following: This means, if x is in the null space of A , x is also in the null spa

Gauss's quote for positive, negative, and imaginary number

Recently I watched the following great videos about imaginary numbers by Welch Labs. https://youtu.be/T647CGsuOVU?list=PLiaHhY2iBX9g6KIvZ_703G3KJXapKkNaF I like this article about naming of math by Kalid Azad. https://betterexplained.com/articles/learning-tip-idea-name/ Both articles mentioned about Gauss, who suggested to use other names of positive, negative, and imaginary numbers. Gauss wrote these names are wrong and that is one of the reason people didn't get why negative times negative is positive, or, pure positive imaginary times pure positive imaginary is negative real number. I made a few videos about explaining why -1 * -1 = +1, too. Explanation: why -1 * -1 = +1 by pattern https://youtu.be/uD7JRdAzKP8 Explanation: why -1 * -1 = +1 by climbing a mountain https://youtu.be/uD7JRdAzKP8 But actually Gauss's insight is much powerful. The original is in the Gauß, Werke, Bd. 2, S. 178 . Hätte man +1, -1, √-1) nicht positiv, negative, imaginäre (oder gar um

Why parallelogram area is |ad-bc|?

Here is my question. The area of parallelogram is the difference of these two rectangles (red rectangle - blue rectangle). This is not intuitive for me. If you also think it is not so intuitive, you might interested in my slides. I try to explain this for hight school students. Slides:  A bit intuitive (for me) explanation of area of parallelogram  (to my site, external link) .