Dark data is all around us, unseen and unused.
Data Science has many buzzwords associated with it: big data, Hadoop, Visualization, machine learning ... to mention but a few. One of the most recent to trend is that of dark data.
So what is dark data ?
Simply put, it's data that hasn't been (fully) analysed yet.
This can be for a variety of reasons. It can be that it's simply not been analysed, or it may be that it's data that's not fully accessible. At this point, I should confess that my view of dark data differs from the more widely held view that derives from Gartner's definition. Gartner basically say that it is data that you're holding in your organisation but are not analysing or using to its full potential at the moment. Personally I think that this is a subset of dark data, what I'd classify that as Type 1 Dark Data or "Sleeping" data.
There are two more types, and to be more honest, they are the more interesting ones.
Type 2 Dark Data is data that is analysed in an unexpected way and yields results the "publisher" didn't anticipate. A classic example of this is order numbers. You want to know how your competitor's business is doing. So you order a small item from them once a week on a Monday. You get an order number from them. Human nature means order numbers tend to be incremented by one for each order. So you can work out the number of orders they are getting each week. If they publish accounts, and you can find out their turnover and profit, you can now work out the average size of their orders and how much profit they are making on each of them. Useful data that they certainly didn't expect to give you :)
I refer to this type as Type 2 Dark Data or "Leaking" data.
Type 3 Dark Data is data you can't see directly, but you can infer its existence by looking at its effects on the world around it. This is the type that is closest in meaning to what physicists are talking about when they refer to the "dark' in "dark matter". Dark matter is matter you can't see but you can infer its existence by the effect it's having. In this case gluing the universe together. Type 3 Dark Data is the same, you can't directly see it, but you can infer it's existence by the effect it's having. An example of dark data type 3 is the analysis of German bomb casings by the Allies during the Second World War. When you make bomb casings you build them from scrap. It's going to be blown up so you make it from whatever you can get hold of easily and you're not going to use for anything else. But this means that if you, as the Allies, analyse the metallic content of the German bomb casings you can work out which metals they are short of by seeing which metals are absent from the bomb casings. Better yet, if you do this over time, you can see if these shortages are changing. So the Germans had a list of metals they were short of. The Allies couldn't see that list because it was highly classified. But they could worked out what was on it because of the effect that list had, that those metals were kept out of the bomb casings.
I call this type of dark data, as Type 3 or "Hidden" data.
So dark data comes in 3 forms, Sleeping, Leaking & Hidden.
Identifying and acquiring dark data can be quite challenging but is great fun :)
I'll throw in a bit of teaser here - I think there is a fourth type as well. But I'll leave that for another post to discuss :-)
Primary source: Dark Data by Prof Mark Whitehorn
"What sane person could live in this world and not be crazy." - Ursula K. LeGuin
Image credit: 123RF.com