2013-01-03

Authors in a Markov matrix Part 2 (11): Appendix


Appendix A: Unicode and Python 2.7.x


This time I develop python programs. I use python 2.7.3. Handling Unicode was needed to process web pages, not only for Japanese and German web pages, but also for English pages. Because some of the English authors have accent characters.  In the early development stage, I was bothered UnicodeDecodeError and UnicodeEncodeError exceptions. Here I will explain what they are, why they raised, and how to handle them.

How the Unicode encodes characters?


As far as I understand, Unicode uses two maps to encode characters. This depends on how you understand this coding system. I hadn't known this until I worked on this research. My understanding was that there are many kind of Unicode, like UTF-8, UTF-16, UTF-32. But this was my misunderstanding. UTF-8 is how to encode the Unicode data and Unicode is an encoding system how to encode characters. UTF-8 is one of the mapping methods, or transformation formats and UTF-8 is not Unicode (Universal character set). This is cumbersome.

  • Unicode: a map from number to character
  • UTF-X:   a map from Unicode encoded data to a specific data

Unicode itself defines a map from numbers to character descriptions. This in one bijection map. For example, 0x0061 'a'; LATIN SMALL LETTER A is an entry of the map. In this example, the number is 0x0061, this is called ``code point,'' and description of this number is `` 'a'; LATIN SMALL LETTER A.'' Using a map of the description to a font, we can see a letter `a'. The shape of the description is called glyph. Unicode has the bijection map, therefore, we can also say, `a' is map to a code point 0x0061.

This description and code point mapping is Unicode. A character is represented by a number. But, this Unicode's code point map is usually not used. Here ``usually'' means, the Unicode encoded text usually doesn't have this code points. Most of the case, a code point text is converted to UTF-X (UCS (Universal Character Set) Transformation Format X). There are several UTF, for example, UTF-8, UTF-16 with endian information. Most of the case, the Unicode encoded text is converted to one of the UTF, then, these UTF binary is save to your disk. This conversion is common, therefore, I misunderstood that there are many Unicodes, which I thought odd, since Uni means ``one.'' The concept of Unicode I understood was you can use all the characters, no matter which language you are writing, even you can mix any language characters. If there are several kind of Unicode, this coding system doesn't make any sense. This was wrong understanding. Unicode itself is one map. But when you use this coding system, there are several formats. These formats are UTF-X. Why is this so complicated? To keep all the characters in the world needs some space. This means if someone switch to ASCII to Unicode, your file size becomes suddenly four times larger. This is a dilemma: you want to have a big character set, and, you don't want to make your file larger. To solve this dilemma, the second mapping, UTF-X was introduced.

I learn these information from [3].

Python 2.7.x's Unicode representation


Python 2.7.x has two build-in datatype for representing strings: unicode data type and str data type. Both data types can keep the printable strings, however, str is more suitable for ASCII characters, even it can keep any 8-bit binary.  Each data type has a encoding or decoding method to convert the encoding [2]. However, this encoding sometimes causes a problem when you print the unicode data. Figure [8] shows the relationship between str type, unicode type, and encode decode methods.

Figure 8: Relationship between Unicode type and 8-bit str type in Python 2.7.x.

The unicode type of Python 2.7.x has a method encode() and the str type has a method decode(). We can convert each type to the other type via these methods. However, some encoding method can not apply to specific byte sequence, since some byte sequence is not valid for some encoding. For example, `ascii' encoding method doesn't allow when a byte data's 8th bit is on. When we specify the error handling method `strict' for an encoding, the encoding or decoding method may raise an exception. UnicodeEncodeError can be raised by the encoding method, UnicodeDecodeError can be raised by the decoding method. This is a bit cumbersome. Let me show you some examples.

First we define an unicode type strings.
uc = u'Wächter'
print type(uc)
-> <
type 'unicode'>

Let's encode this with `utf-8' encoding to a str type.
s = uc.encode('utf-8', 'ignore')
print type(s)
-> 
<type 'str'>

Python 2.7's print statement accepts str type in default, but not unicode type, therefore, the given unicode data to the print statement will be encoded.
print uc
-> UnicodeEncodeError: 'ascii'
codec can't encode character
u'\xe4' in position 1: ordinal
not in range(128)
uc has a invalid character in ascii code, therefore an exception has been raised. Please note, the error is an encoding error.

However, encoding method has an option to ignore the encoding error.
print uc.encode('utf-8', 'ignore')
-> 'Wächter'
If your terminal accepts the utf-8 encoding, you can see the unicode character.

There is a more complicated case, for example, an encoded str type data is decoded back as the following:
print u'{0}'.format(uc.encode('utf-8', 'ignore'))
-> UnicodeDecodeError: 'ascii'
   codec can't decode byte 0xc3 in
   position 1: ordinal not in range(128)
Here, the format method gets str type data, but, this format method is an unicode type's format method, therefore, this accepts only unicode data. The uc.encode generates str type data, this doesn't fit to the unicode.format method, therefore, decode method is called to generate an unicode data before the format method is called. This decode method cannot handle some unicode character, therefore, the UnicodeDecodeError exception is raised. I was puzzled by this exception why this is not an EncodeError, since it seems only encode method is called. But, actually there is a hidden another decode method call is in this code. To avoid this, we can encode after the format method is called as the following:
print u'\{0\}'.format(uc).encode('utf-8', 'ignore')
-> Wächter
This is a subtle issue, however, we cannot ignore this to write a code that handles unicode.


Appendix B: Contribution to Wikipedia


We found some mistakes in the Wikipedia's authors list as a side effect of this experiment. We have contributed to update the list in Wikipedia.

We needed to generate an adjacency matrix and performed eigenanalysis on the matrix.  We require the independence of the eigenvectors in this analysis. However, it is almost impossible to have such good matrix in our problem setting. Because it is hard to avoid a few problems: a page which doesn't have a link to any other authors, a page which has no reference link from others, mistakes of link duplication of the root page. PageRank algorithm expects this kind of singularity in the adjacency matrix and gave us a solution of this issue.  In our problem settings, we can easily detect the last issue, link duplication of the root page.

I am happy to contribute Wikipedia.


References

[2] Brené Brown, The power of vulnerability,
http://www.ted.com/talks/brene_brown_on_vulnerability.html

[3] Python documentation 2.7.3, Unicode HOWTO, http://docs.python.org/2/howto/unicode.html


Authors in a Markov matrix Part 2 (10) Experimental results: Which author do people find most inspiring?


Conclusion


To find out that which author do people find most inspiring, we used the link structure of Wikipedia. First we extracted the link structure of Wikipedia and create the adjacency matrix, then we apply an eigenanalysis method, which is also called PageRank, to answer the first question. We showed the results of German, English, and Japanese authors.  We also compared the same category (authors), but between the different data source, i.e., different language Wikipedia. We can see the interesting similarity and also difference.  Personally, one of the authors was surprised me that Winston Churchill and Issac Newton have a high ranking score. He didn't know Winston Churchill is the Nobel Prize winner of the literature.


Computational literature


Recently, I use a mathematical approach or an information scientific approach to understand literature and languages. This approach has a huge limitation, but on the other hand, it gives me some measureable values. Brené Brown said in her TED talk [2], ``Maybe stories are just data with a soul.'' Maybe so. And I think the soul can cast a vague shadow on data. I agree that we can not reconstruct the soul now.  However, reading a book is just an act of reading symbol sequence -- reading data sequence --, I still know my soul can be moved by the act.  I want to see a footprint of the soul in the data. This article is one of this kind of trial. I don't know how to call this approach, therefore, tentatively, I call this approach ``Computational literature,'' until I found a better name.

Future work


We summarize the future work:


  • Are there any bias based on Wikipedia writers? (Ditger v A.)
  • How can we avoid the category problem. How can we automatize the data collection.
  • Apply other graph analysis methods. We only apply the eigenanalysis (PageRank) in this article.
  • We saw the adjacency matrix has some property (e.g., not full rank). We can deeply look into the graph structure using some graph theory tool.
  • It is interesting to apply this method to other language authors.
  • This method is not limited to authors. We can apply this method to other area, for example, actors, musicians, politicians, mathematicians, and so on.


This was a relatively large project as a Sunday research, it took almost half a year. But, this was fun.


Acknowledgments


I thank to all the friends who gave me a lot of useful comments at the lunch time. Thanks to Andy K. to check some part of my English in part 1. I thank to Rebecca M., who first asked me the question. This project doesn't exist if she didn't ask me the question.

2013-01-02

Authors in a Markov matrix Part 2 (9) Experimental results: Which author do people find most inspiring?

This time is a follow up discussion of the result.

No link found problem


We have an impression there are some amount of Japanese author links that have no reference page in German Wikipedia. We didn't check the exact numbers, but while we debugged the program, we looked into several pages. A typical no link reference case is, for instance, a page mentioned about 良寛 (Ryōkan) has a link to Ryokan, or Sōseki link to Seseki, and so on. These special characters are often omitted, this causes no link reference found.

Cross reference between Wikipedia


It was relatively easy to make a cross reference list between English and German Wikipedia results since these Wikipedias share how to write the author names, i.e., using the Latin character set. However, Japanese Wikipedias uses Japanese characters for the author's name. For example, Lowis Carroll is ルイス・キャロル in Japanese Wikipedia. In Japanese Wikipedia has the information also in Latin characters, but, the Wiki page keys are all in Japanese. To make a cross reference table, we need to have a Japanese written name to Latin written name map. We could not find a easy way to do that this time, therefore, there is no cross reference between English and Japanese results, or between German and Japanese results. This is also a future work.

Correlating with other data


We have some discussion with our friends about these results. They have some interesting questions. Especially they are interested in correlating with some other data:

  • Correlating Nobel prize winner and PageRank results
  • Are there any correlation between Wikipedia's writer and PageRank result. For example, if a few specific Wikipedia writers are actively writing the articles, are there any bias of these Wikipedia writers bias in the PageRank results?

Johann Wolfgang von Goethe is 10th in Japanese Wiki


The rank of Johann Wolfgang von Goethe is 10th in Japanese Wikipedia. This is unexpectedly low for us. However, the total number of Japanese Wikipedia pages that can construct a valid graph is only 31. The number of pages is too low and a slightly different link structure may change the result. By the way, the first rank of German writers in Japanese Wikipedia is Gerhart Hauptmann.

This was a long article, but, now we are close to the end. Next time, I will present the conclusion of this theme.

Authors in a Markov matrix Part 2 (8) Experimental results: Which author do people find most inspiring?


Wikipedia's Category problem

The category problem here is: we expect a specific category has some expected authors on the list, but the actual Wikipedia's category doesn't have the authors we expected. This causes some data missing. There are three interesting cases we found in the following subsections. We didn't do any additional process for this problem. For example, ``Shakespeare does not exist as an English writer in the Japanese Wikipedia.'' Since we did nothing for this, there is no Shakespeare in the English author rank table in Japanese Wikipedia in our result.

We tried to obtain the data as automatic as possible since this is just our Sunday hobby research project. We didn't spend much time for the fine tuning of these problem. But these are not intuitive (e.g., Shakespeare is not an English author in Japanese Wikipedia.), so how to automatically fill this gap between Wikipedia sense and our intuition is the future work.

No Shakespeare in the Japanese Wikipedia result

The rank of Shakespeare is the best in German Wikipedia and English Wikipedia. However, Japanese Wikipedia doesn't have Shakespeare. Actually, in Japanese Wikipedia has a category called, ``Shakespeare'' and it is the same level of English authors.  The level of English authors category has the following categories in Japanese Wikipedia (as of 2012-11-19) and they are not classified as English authors. Figure 7 shows this page.

Figure 7: The category of English authors page in ja.wikipedia.org as of 2012-11-19.

  • English authors (which has an item: The list of English authors)
  • H. G. Wells
  • William Shakespeare
  • George Bernard Shaw
  • Lord Byron
  • William Blake
  • Oscar Wilde
These authors and the category ``English authors'' are at the same level in the category hierarchy, therefore, Wells, Shakespeare, Shaw, Byron, Blake, Wilde don't exist in the list of English authors. This is a property of Japanese Wikipedia only and other language Wikipedias don't have this problem. The problem was we assumed that the list of English authors have Shakespeare and other those authors. We thought this assumption was reasonable when we started this research.

No Shiki Masaoka in the Japanese Wikipedias result

Shiki Masaoka doesn't exist in the Japanese Wikipedia result. Shiki is under the Japanese 歌人 俳人 (Kajin Haijin) category and not in the Japanese authors category. Therefore, Japanese Kajin Haijin are not listed in this research. We found this when the first result we got by comparing different Wikipedias. This is a good example that the comparison between other Wikipedia is effective.

Not available in other Wikipedia problem

Some Wikipedia categorizes the author depends on what language they wrote instead of which country they lived. For instance, German Wikipedia has ``the list of British authors,'' but English Wikipedia only has ``The list of English writers.'' This list has the authors who wrote their book in English, therefore, it also includes American and Australian, and other English speaking countries' authors. As a result, the comparison between different language Wikipedia is not well defined.

There is another factor that makes the comparison difficult. The size of the list of authors highly depend on each language Wikipedia. For instance, the list of German authors of German Wikipedia has 5975 entries. On the other hand, The list of German authors of Japanese Wikipedia has only 136 entries.

Table 7 and 8 show the PageRank results comparison between German Wikipedia and English Wikipedia. There are n.a. (Not Available in other Wikipedia) entries in the both tables. This entries show the problem.  Table 8 has 16 n.a.s out of 40 entries. This means these authors are listed in the German Wikipedia as British writers, but they are not listed in English Wikipedia as English writes.

We would like to continue some other interesting issues in the next article.