2012-02-09

gcc 4.5.x or higher undefined reference link problem when shared library refers another shared library.


This is too much details of gcc, but, some developers might be interested in.

Recently I switched to gcc/g++ 4.6.x, then I experienced a linking problem of my C++ programs. It suddenly missing symbols. even some missing system symbols are reported  (dlopen, ostream operators...). For example,
libutil.so: undefined reference to `dlopen'
libutil.so: undefined reference to `dlclose'
libutil.so: undefined reference to `dlerror'
libutil.so: undefined reference to `dlsym'
I tried several things, checking libdl.so, manually add linker options.., but nothing helped. Finally I found http://wiki.debian.org/ToolChain/DSOLinking page.

gcc 4.5.x (or higher) passes --as-needed linker option by default, this gives you some missing symbol when your program linked shared library that implicitly linked shared library. For a package creation, this new default setting removes dependency, therefore, this default makes sense. However, this is a difficult problem.

Solving this problem, add --no-as-needed linker flag. E.g., '-Wl,--no-as-needed' in the compiler option.

2012-02-05

Integer division. Is int(-1/2) 0 or -1?


My friend, Christian told me a story about what is int(-1/2), 0 or -1? Christian found a problem when he explained a binary search program. When he computed the lower bound, but, the entry is not in an array.

  mid = (left+right)/2

When left = -1, right = 0. The mid is 0 in C, C++, and Java. The mid is -1 in python and ruby. I was also surprised this simple looking code depends on languages.

This is based on rounding method. C, C++, Java, and elisp does round to zero (truncate), therefore, this is 0.

python, ruby does round down (floor), therefore, this is -1. When we think about modulo, python and ruby holds

   x = x/y + x%y

condition, but not C++, Java. The modulo operation also depends on the languages when minus value is considered.

One more note, when I looked up binary search on the web, this mid code has overflow problem, so, it should be

  mid = low + (high - low)/2;

References

http://en.wikipedia.org/wiki/Modulo_operation
http://en.wikipedia.org/wiki/Rounding
http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html
http://xlinux.nist.gov/dads//HTML/binarySearch.html
http://www.codecodex.com/wiki/Binary_search


Acknowledgements

Thanks to Christian R.

2012-02-02

Future work: Computational literacy


Future work: Computational literature

I don't know how to call these kind of approach to natural languages, I might say, computational literacy, or something like that. As far as I know, there are some research of similar approach. For instance, some kind of spam filter using entropy based approach. Some people use statistical approach to finding an author of a document. In a Science Fiction novel, Asimov wrote a scene that a politician talked a lot, but people find out there is no information at all in the talk by an information analysis (The Foundation Series).

We can extend the presented method more systematically way. For example, we can analyze famous widely available books, e.g., the Bible, some Shakespeare's, IKEA's catalogs, and so on. Also the translation of the Bible altered in the history, I would like to see the history of the information in it. If you know anything about research in this approach, please put it in the comment.

Appendix 1: person + tree = ?

The Kanji combined a person (人) with a tree (木) means ``rest (休).'' When I was an elementary school student, the explanation of this character was ``a person is resting under a tree.'' Many of Kanji are composed by several basic Kanjis. A tree (木) is a tree, but two trees (林) means woods, then three trees (森) is forest. There is no four tree character, but if it exists, that would mean jungle.

Appendix 2: Information theory, entropy, and compression algorithm

Some readers may not familiar with Information theory, entropy, or compression algorithm. In this article, I could not provide the explanation why we can measure the entropy in the document by a compression. If someone wants to know further, the following Wikipedia's article would be a good start. http://en.wikipedia.org/wiki/Information_theory

Acknowledgments

This article is based on many party discussions. So many my friends are participated in these discussions. It started from I lived in Saarbr\"{u}ecken and until Gr\"{u}nkohl party. The ideas, for instance, apply the compression method on the Bible, the history of the compressed size of the Bible, are born from the discussion with my friends.  I thank to all my friends who participated in this discussion.

Can we measure the complexly of natural language by an entropy based compression method? (6)


Conclusion

When we wrote an article in different languages, the length of the document differs even the contents are the same. But, if we compress these files by a entropy based compression algorithm, they become almost the same size. Even we wrote it in German which has complex grammatical structure, or in Japanese with completely different character system. From this observation, I have a hypothesis: ``The complexity of natural languages are more or less the same.'' Of course this article tested only one document and only three different languages. Therefore, this cannot be any proof of this hypothesis. But still, I am interested in the result.

We need more experiences, but, now I got some ideas of the applications if this hypothesis stands.

  1.  Comparison of news articles: Assume the news source is in English and it is translated into Japanese. If we compress these two articles and the compressed size differs more than 50\%, I suspect the quality of the translation. Sometimes, there are quite different content news I saw. Other interesting application is Wikipedia pages comparison between languages.
  2. Comparison of article authors: How much does the entropy differ between articles written by one person? Maybe the entropy are similar. We cannot compare two different documents, but how is the compression ratio? For example, we can compare the compression ratio of Francis Bacon's documents and William Shakespeare's documents. However, the good authors might be able to simulate other persons. Therefore, this comparison might be difficult.
These are all my hypotheses. But I am pretty sure some of the people know more about this. For example, computer game people always concern that how to fit all the data in a limited memory devices. The data must be compressed. Now big game companies become international and they translated many contents. I believe these people have some knowledge about the contents entropy. If someone know something, please put some comments.

My friend Daniel L. also pointed out that phone signal compression may depends on the language. Once I read an article NTT Docomo uses interesting basis for frequency decomposition and lossy compression. I did not recall it is suited for especially Japanese sounds or not. These language dependency algorithm (or language dependent parameters) may be an interesting topic.

Can we measure the complexly of natural language by an entropy based compression method? (5)


Entropy of a document

When I talked with Joerg, I recall my bachelor student time. At that time, I could not write a paper in English directly. Therefore, I first wrote a manuscript in Japanese, then, I translated it in English. The size of each TeX file differed, hoverer, when I compressed these files, I realized the compressed file sizes are similar. I found it interesting, but, I did not think further on that. It was around 1996, so I think I used ``compress'' program.

At the Gruenkohl Party, I recall this story again. I also realized I have translated a few articles to three different languages. For example, Haruki Murakami's Catalunya Prize speech at 2011-6-11. Figure 1 shows the compressed result of the same contents, but the different language and different encoding scheme.

Figure 1. The compression size result of three languages, but the same content's documents. Even the original document size depends on encoding methods, but the compressed sizes become similar.

The raw text size differs depends on languages and encoding. However, entropy based compression tool shows that the entropy of information neither depends on the language, nor the encoding scheme. We use bzip2 version 1.0.5 as an entropy based compression tool. You can download the each original document at original documents, so you can also check the results. If you are interested in, you can also translate your own language and bzip2 it. If you got another language result, please let me know.

I choose this document since this is not my alone work. I asked native speaker friends to help the translation. I explained the Japanese contents to my friends and they choose the words and the structure. If I do this alone, my non native vocabulary or grammatical structure might bias the translation. But these translations have less problem than other my own translations.

Let's see a bit more details. Figure 2 shows two encoding scheme affects the file size even they are the same contents and the same language. UTF-8 needs three bytes for one Kanji, but EUC needs two bytes for one Kanji. Therefore, UTF-8 encoded file is significantly larger. But, if we compress different encoded Japanese files, the size becomes almost the same. This is expected since the contents are exactly the same, it's just different mapping. Therefore, the entropy of the files are the same. However, when these are translated to English and German, the compressed file sizes becomes similar. This is interesting result to me.
Figure 2. The compression size result of Japanese document, two different encoding scheme: EUC and UTF-8. EUC encode one character in two bytes, but UTF-8 encode one character in three bytes. Yet, the bzip2 compressed size becomes similar. 

2012-02-01

Can we measure the complexly of natural language by an entropy based compression method? (4)


Size of books depends on languages

My friend (and my teacher) Joerg once asked me that how large the Japanese translated books compared with other languages. Japanese uses Kanji (Chinese characters). Because these characters can encode several Latin characters to one Kanji, he inferred Japanese translated books are smaller or thinner than the original ones. For example, the character ``mountain'' is one character ``山'' in Japanese. But, Kanji usually needs higher resolution compared with Latin characters.

I answered the books seems thinner than original ones. I have several Shakespeare's books, and I assume these translations are as accurate as possible. Some friends visited my place impressed how small Japanese books are. But, there are some other factors, for instance, a Japanese book might be made of thinner paper, the size of characters might be relatively smaller, and so on. This is an interesting point, but, we must consider many parameters.

Can we measure the complexly of natural language by an entropy based compression method? (3)


Complexity of language

I recalled two ideas when we were talking about the difference of languages:

  1. Complexity of language, 
  2. Size of books depends on languages.

My friend (and my teacher), Alexander has a hypothesis: the complexity of all the natural languages is more or less the same. His complexity of a language means the total complexity of a language. It includes number of vocabulary, grammatical structure, representation of writing (complexity of characters), pronunciation, anything. He told us that any language has some difficult aspects, but at the same time, there are some simple aspects also. If we can average all the aspects of each language, and compare them, complexity of natural languages might be almost the same.

I have the same impression about language complexity with Alexander. Each language I have learned has some difficulty and also has some simple part. I also think the complexity of natural language is depends on human brain ability. Because any children can learn any languages (at least speaking and hearing), I expect the complexity of natural languages is more or less the same.

Can we measure the complexly of natural language by an entropy based compression method? (2)


Gruenkohl Party

At January 20th, 2012, We had a Gruekohl party at Daniel's place. The gathered people were from Holland, Germany, US, Canada, and Japan. At a such international party, we often talk about own languages and compare their properties.

For example, one told us how the Chinese pronunciation system is complex and almost impossible to learn that according to his Chinese course experience. German's Noun gender and article system is also a popular topic.

A friend pointed me out, Japanese has special counting system.  When we count objects, how to count depends on what you count.  For example, how to count person and how to count paper are different, yet I explain we always uses units like English saying, two piece of papers and three pairs of jeans. Japanese has this counting system all the time.

I usually heard many languages are so difficult to learn. However, I suspect it may be not complex as it sounds. People tend to pick the most difficult aspect of a language. But it is not always complex as sounds. For example, Japanese uses 3000 characters, but, most of these characters are combination of around 100 basic characters. Many people succeeded to learn Japanese, Chinese, German, and so on.

By the way, my favorite example of Chinese characters is the combination of ``person + tree''. What does ``person + tree'' means? A lumberjack is the most popular answer from my friends, but, I never heard the correct answer so far. I will put the answer in the appendix of this article.

Can we measure the complexly of natural language by an entropy based compression method?(1)


Many of my friends came from other countries.  We often talk about our own mother tongues. The discussion goes to which language is difficult or what kind of unique property each language has. German has a complex grammar system, Japanese has complex characters and unique counting system, and English has a huge vocabulary. I wonder ``What is the complexity of natural languages?'' and ``Can we measure them?''

Together with my friends I translated one Japanese text to English and German. Then we apply an entropy based compression method on them to see how much information each translated text has. This might tell which language is complex in a sense of entropy. Namely, I try to measure that ``If the contents are the same, how much information entropy differs depends on a language?''

I will write a few articles regarding with this topic.