(A fable on huge amounts of data and why we
don’t need models)
There was a pig who wanted to be a scientist. He was not interested in models. When asked how he planned on making sense of the world, the pig would say in a deep mysterious voice, “I don’t do models: the world is my model” and then with a twinkle in his eyes, look at his interlocutor smugly.
By his phrase, “I don’t do models, the world is my model”, he meant that the world’s data was enough for him, the pig scientist. The more the data, the more accurately the pig declared, he would be able to predict what might happen in the world.
Around that time, some dogs opened a pub called, “Doogle” which was visited by all animals in the jungle. The wine was delicious and the traffic at the pub was unprecedented. The dogs became rich and famous; they also obtained a lot of data from the visiting animals. They bought even more pubs and collected even more data about their customers.
Now, they wanted to analyze this data to attract even more customers towards Doogle. The pig saw this as a big opportunity and gathered other like-minded pigs. The drove of pigs helped Doogle in applying pigstatistical methods (ham-correlation formulation etc), to predict various things including: kinds of animals attracted to the kinds of beverages; drinking patterns of different
animals; the kinds of tables liked by classes of animals; arrival times; number of glasses Doogle would need in the near future, etc, etc, etc. To an astonishing degree, the pigs made quite accurate predictions using their pigstatistics.
The services of our pigs were acquired by other entities including FaceSlap, Barker, and Snorter, among others. Our heroic pigs helped their clients in outshining the competition. In fact the pigs method of collecting huge amounts of data and then applying pigstatistics on it came to be known as “Pig Data” in their honor.
In the meantime, somewhere in the jungle, the group of owl scientists who had through history been making models and theories and performing experiments based on them, were now being told that it was all meaningless; that their approach was worthless. The owls didn’t pay any attention, even though everyone else was euphoric. However, if the truth be told, some owls did lose heart and became so demoralized that they gradually transformed into pigs! And immersed themselves deep in the world of pig-data.
From time to time, Doogle, FaceSlap and others, would make some modifications, such as changing the color of the wine-glass and seeing how quickly people reached for the glass based on the color. Upon analysis of the customers reactions, the pigs could then analyze which color resulted in the fastest response-time. So this was the era of pig-data. The pigs had won the battle. Pig data was everywhere.
But the fact is that our hero-pig, whom we met at the beginning of our fable, was still not happy. He felt that things were only getting started. He wanted to replace the owls completely. What’s more, he wanted to predict EVERYTHING. He wanted psychohistory, as the ‘good doctor’ of old had dreamt. Yes sir, predicting everything was his goal!
He decided to start his quest by studying falling bodies. As was his norm, he collected data about all instances of all objects falling down all over the place. He now had huge amounts of data, and he applied pigstatistics on it. He discovered that more things fell in the morning and during day-time, when animals were awake, and fewer things fell during the night when animals were sleeping!
He shared his findings in front of the whole jungle, looking directly at the owls, who were also present. The chief owl, called Owlileonewtein, countered that while such information could be useful, it did not explain much. Why did bodies fall? At what rate did they fall? What were the relevant factors, etc?
On hearing this, the pig positively beamed with joy because he had come prepared. He announced proudly that he had found a correlation between the weight of the body and the speed of falling. His stats told him that while heavy things fell at a great speed, light things such as animal hair, bird feathers, etc fell much more slowly. “So therefore”, he thundered, “I have discovered the law of falling bodies. Heavy goes fast; light goes slow.” All the animals clapped in joy. The law of falling bodies had been discovered!
Upon hearing all this, Owlileonewtein, the chief owl, said forcefully, “But this is not correct. If we ignore friction and air resistance, I can tell you that all bodies, regardless of their heaviness, fall at the same rate. Indeed consider a frictionless plane…”
But as soon as he said this, the pig snorted, “Frictionless plane? My dear animals, has anyone ever heard of such an oxymoron?” All animals laughed.
Owlileonewtein protested: “No, based on my model, we can do suitable experiments to test it…”.
On hearing this, the pig suddenly got very serious and menacing. He lifted his paw and pointed it at Owlileonewtein, “You sir, are a relic of the past. Your way of doing things is over. Haven’t you heard what my fellow pig scientist, Peter Norpig, head of pig intelligence at Doogle, has said, ‘All models are wrong, and we can learn models from data.’ So enough of your models and enough of your model-based experiments. We need neither! All we need is pig-data!” And with this, the pig in his furious excitement stood up on his hind-legs, and shouted, stretching the word ‘pig’ with the full force of his pig personality: “Piiiiiiiiiiiiiiiiiiiiig!” And the animals responded: “DATA!”
“Piiiiiiiiiiiiiiiiiiiig” — “DATA”! “Heavy goes fast; light goes slow!”
Having demonstrated his power to the owls, as a last act of annihilation, he picked up a stone from the ground and tore away a strand of hair from his tail. Holding one object in each fore-leg, he dropped them at the same time. The stone reached the ground much earlier than the strand. With this, he dusted one fore-leg against the other, and then turned around to show his backside to the owls. He shouted triumphantly, one last time:
“Piiiiiiiiiiiiiiiiiiiig” — “DATA!”
(Endnote: There is nothing wrong with huge amounts of data, i.e. big data. But we need to think about the direction that we are taking. What are our goals? And what are the potential benefits and potential limitations of such an approach. Does analyzing all kinds of data make any sense at all? Where does it make sense? Where doesn’t it make sense? Should we be reflecting on the claims being made about big-data, etc.)
Update @ 6 April, 2013: I made the following animation that makes a similar point in a different way.