When you look at data, one of the most important you need to consider is "Can I answer the question I'm asking, with the data I have?".
There's a famous example of this from the Second World War that I was reminded of by this post on LinkedIn by Nicholas Nunno.
The basic premise is that the Navy of the time wanted to improve the armour on their combat aircraft, so they recorded where planes were getting shot during mission over hostile territory and got the results below. The obvious solution to improve survivability of your aircraft, is to increase the armour in the areas with lots of red dots. Now the question is, can you see the flaw in this?
Yep, that's right - the problem is they were only analysing the survivors. A fact that was obvious to Abraham Wald, a statistician of the time. He suggested they'd be better off armouring the nose, the engines, and the mid-section. Why? Because the aircraft being hit there aren't making it home. In statistical terms this is known as bias. The Navy had bias their results because they were only analysing the survivor. What they'd actually found was the places on an aircraft that could absorb significant damage without suffering a catastrophic failure.
Ironically of course they had actually answered the orignal question of where do you need to increase the armour on a combat aircraft to increase the chances of survival as well ... just not in the way they'd expected.
So when you look at a new data source with a question, you need to make sure that the data you have can actually answer.
And of course you always need to think outside the box :-)
“You can't force creatives into a box. If you try, they'll no longer be creative. And no one will want your box.” ― Ryan Lilly