Saturday, April 23, 2011

Is software the real bottleneck?

"The real bottleneck is software. Creating software can be done only the old-fashioned way. A human—sitting quietly in a chair with a pencil, paper, and laptop—is going to have to write the codes, line for line, that make these imaginary worlds come to life. One can mass-produce hardware and increase its power by piling on more and more chips, but you cannot mass-produce the brain."
Michio Kaku "The Physics of the Future"

It's true that currently all the software is created in the old fashioned way - i.e written by humans. However, the whole idea of artificial intelligence and singularity is closely connected with changing that.

The goal is to allow machines (or rather software powering them) to improve themselves by analyzing and changing the initial code written by people.

It is about machine learning and machine adaptation (you should replace software for machine in both cases).

As software code becomes even longer and more complex, another software layers piling upon themselves, hardware gets more powerful allowing implementation of more advanced algorithms, the number of connections in various kinds of networks grows, people will have increasing problem with keeping pace with all these changes.

They will have to allow software to modify itself. And the bottleneck mentioned by Michio Kaku should disappear at the dawn of singularity...

Thursday, April 21, 2011

Deviations from normal distribution depending on sample size

It's obvious when you visualize it :)

Depending on the number of observations in the sample (N), the perceived distribution (black) may look quite different from what you'd expect from the random sample generation based on the normal distribution (red).

Source available here.

One more comparison for binomial distribution:

I would have forgotten if not Google! ;)

It's this day of the year. And I would have forgotten if not Google ;)

Another sign of how many information are gathered and processed by the company.

Sunday, April 17, 2011

Analyzing search keyphrases similarities with different methods

Have you ever wondered what kind of similar keyphrases people use when searching for something and how do they compare?
I have :)

Using a subset of keyphrases used by visitors of (some 550 items), I wrote a simple R program that tries to assign potentially similar phrases to each of the keyphrases used.

I employed three methods for this task:
  1. Levenshtein distance - see source code here
  2. length of longest common substring (LCS) - source code of my implementation here
  3. phrases starting with base phrase - very rude implementation available here
Check the small selection of the results of this exercise:

You can download all the results here.

As you will notice, each of the method gives different results. None single set of the results is definitively better than the others. Combining them in some way seems the most interesting solution.

Correcting spelling mistakes before analyzing dependencies seems important for improving results. Some other methods can be added as well.