Tag Archives: data

Take the Survey

surveyCreating a good survey, one that gives you robust results, takes skill. In a former life I worked for a data analytics company where a team worked on creating surveys for consumers where I gained an appreciation of the skill. I have sincere worked with online surveys. Here are some aspects of survey design to consider.

Sample Size

Imagine you want to know whether Dutch people prefer dark or milk chocolate. The population of the Netherlands is 16.8 million. How many of them do you need to ask?

It turns out, not that many. If I collected data from 1067 people I could be 95% sure that my answer as correct with a margin of error of 3%. That means that if 70% choose milk chocolate the answer in the general population will lie between 67 and 73%. So if you’re a chocolate manufacturer you now know to make most of your flavours based on milk chocolate.

You can be more sure of the answer the further the outcome is from 50%. For the chocolate maker an answer of 47-53% would still be useful, but it’s problematic if you’re predicting political outcomes.

Once upon a time I knew the maths behind these calculations, now I just use an online calculator

Sample Selection

Your sample should reflect your target population as much as possible. This may involve excluding some people from  participating – if you are researching hair care products you don’t need bald men in your sample. For wider. issues it is more likely that you will try to construct a sample that mirrors the total population in terms of gender, race, age, income, family status, religion, location, gender identity and sexuality. That’s not easy. The further you are from your target group the less reliable the outcome of your survey.

Method Bias

Your method of collecting data may introduce bias, if you are collecting data by calling domestic numbers during working hours you exclude working people. If you collect data online you exclude those not on the Internet, and limit respondents to the small group that find your website.

If you are collecting data online you need to control for bots, and you may want to limit the number of times a respondent can answer.

Question construction

To get useful data from your survey you need to construct your questions to be neutral, unambiguous, not leading and specific.


“Do you smoke cigarettes?” Is neutral

“Are you a filthy smoker?” Is not.


It should be clear what information you ar seeking in your question; there are two traps to avoid here.

  • Asking two things in one question

“how friendly and helpful was your customer agent today?” Asks two things, and it’s impossible to decide how to answer if your customer agent solved the problem but was grumpy on the phone with you. You need to split this into two questions.

  • Using negatives

“Do you disagree that raising taxes won’t create jobs?” Is confusing. Rewrite this to ask “Do you agree that…  ?” to simplify it

Avoid Leading Questions

Leading questions contain details that indicate the expected answer.

“When will you start offering free upgrades?” assumes that you will offer free upgrades.


You will get more accurate and useful data if you ask specifics.

“Do you eat chocolate regularly?” doesn’t tell you much since ‘regularly’ means different things. Much better to ask “how often do you eat chocolate?” and give people a series of ranges to choose from.

What led to this post? A friend posted a strange survey from the President of the United States that breaks every single one of these rules, and a few others.

Here’s the title page of the survey, given that it was sent out after the press conference where the press was repeatedly called “Fake news” the title is clearly priming you to doubt the accountability of the media.


The survey was sent to known Republican supporters, yet the President represents all Americans. The questions are certainly not neutral, and some are just confusing. Here’s the most confusing;

And here’s the most ironic, given that we have already seen that the President uses “alternative facts“, misleading statements and untruths.

 All of which is to say that when the Presidential PR machine talks about having data showing how people don’t trust mainstream media remember his data collection is flawed and the results cannot be trusted.

Images; What?  |  Véronique Debord-Lazaro  |   CC BY-SA 2.0

Believe Data

Content calendarWe were sailing back to our home port and a dense fog descended. Suddenly we couldn’t see more than a boat length ahead. My father, a mariner by profession, plotted a course and steered by it, sending my brother and me forward as lookouts.

My mother was convinced we were sailing in the wrong direction, that we’d steered off course (and this was before the reassurance of GPS). “No,” said my father “you must trust your instruments”.

We made it safely home; it was an early lesson in believing data.

The amount of data produced and collected every day continues to grow. “Big Data” is a well-known, although poorly understood term. In many companies we’ve moved on to “data-driven decisions”. But we’re not always good at believing the data.

I was in a meeting recently where the most senior person in the room looked at a graph of twitter follower growth and said “I just don’t believe this data”. The data showed that goals for follower numbers would not be met. Leaving aside the argument on whether follower numbers is a good goal, the data don’t lie. If there’s a straight line of progress that won’t reach the goal then you need to change something or accept missing the goal.

It made me think about when we believe data and when we should be sceptical.

We tend to measure progress against an expected path, and in a large organisation invariably report that progress upwards in the organisation. In our plans and projections that progress follows a nice upward curve. But the reality is different, every project encounters setbacks, and the graph is more jagged than smooth.

In fact a smooth graph, where targets are always met should raise questions.

Years ago I was chatting to a guy who left his previous company after about four months. He left because the targets for the quarter were increased by 25%, and everyone met them. As an experienced business person he knew that a situation where every business unit met the stretch goal in the first quarter it was applied was very very unlikely. His suspicions were raised and he left as quickly has he could. A year later the company collapsed under its own lies. The company? Enron.

In his articles (and books) Ben Goldacre campaigns for greater journalistic care in reporting data, and better education on scientific method. He points to the dangerous habit of pharmaceutical companies in cherry-picking their data, choosing studies that support their product and ignoring those that don’t.

I said earlier that we should trust the data, but we also need to know how the data was collected, what errors might be inherent in the data collection methodology, and what limits there might be to interpreting the data. This should be part of everyone’s mental toolkit. It would help us evaluate all those advertising claims, refute 90% of the nonsense on the internet, be honest about progress to goals, and finally make data-driven decisions.


Image; Research Data Management  |  Janneke Staaks  |  CC BY-NC 2.0



I Think You’ll Find It’s a Bit More Complicated Than That

Screen Shot 2015-06-02 at 20.35.28I Think You’ll Find It’s a Bit
More Complicated Than That
Ben Goldacre

This is a romp through Dr. Goldacre’s analysis of weak claims and poorly reported science. He argues that journalists should cite, and link to, the sources of the research behind the headlines. He also argues that we, the unsuspecting public should know how to read scientific studies for ourselves, and we should question the reports rather than swallow the conclusions whole.

So if you’ve ever read a science-y headline and thought to yourself “that doesn’t sound right” this book is for you. It takes a look at scientific method and points out some of the pitfalls in constructing a good experiment and in the process gives some pointers about what to look for when evaluating a scientific story;

  • Who funded the study?
  • How well was the experiment designed?
    • sample size
    • scientific method; was there a simple
    • testing a single hypotheses
  • Cherry Picking the data; does the report use a small group of reports to prove a point rather than all research?

In the past three weeks three cases have popped up in social media that prove the need to both hold journalists to a higher standard and to educate us all.

(1) Proving nothing; A Swedish family ate organically for two weeks, and tests showed a drop in the concentration of pesticides in their urine.

So the family had their urine tested for various pesticides on their usual diet, then ate organic food for two weeks, then tested the urine again. Their urine was tested daily over the two weeks and by the end there was almost no pesticide in the urine.

Note that “organic” doesn’t mean pesticide-free, so the family could still have consumed some pesticide with their organic meals. The article doesn’t report on whether that was tested for.

Which the article calls a ” staggering result”. No, not staggering, school level biology. You could do the exact same test with vitamin C. Give people a high vitamin C diet for a month, then remove vitamin C from their diet. Hey presto! No vitamin C in the urine.

This report hits the trifecta; small sample size, poor design, funded by a supermarket with a range of organic foods. Essentially this “experiment” simply proved that the Swedish family have well-functioning kidneys.

(2) Faked Data; There was a really interesting study done on the attitudes to same-sex marriage. It concluded that conversation with a gay surveyor/canvasser could induce long-term attitude change. The study seemed to be well constructed, with a good data set supporting the conclusion. The optimistic news was widely reported late last year when the study was released.

But when scientists started digging into the data, and trying to replicate the results something didn’t stack up. The study has now been retracted by one of the authors, it seems there will be a further investigation.

It’s not always the journalists at fault.

(3) We’re easily fooled; Daily dose of chocolate helps you lose weight.

Screen Shot 2015-06-03 at 14.24.23Before you rush out to buy a week’s supply of your favourite chocolate bars, it’s not true.

But it turns out that it’s rather easy to generate the research and result to prove this, and extremely easy to get mainstream media to report on it. As John Bohannon proved in setting up this experiment and the associated PR.

So there can be flaws or outright fraud in science. Journalists can, on occasion, twist the story to deliver the headline. And we, the public are ready to believe reports that re-inforce our own opinions, and we’re too ready to believe good news about chocolate.

Turns out if it sounds too good to be true we should ask more questions.

Many of the articles in this book are already published in the Guardian, and if you want to read more on bad science Dr. Goldacre has his own site with the helpfully short title; Bad Science. He campaigns for greater journalistic responsibility on reporting science, for using the scientific method to test policy decisions, and for better education on scientific method.

He’s right, on all three.

Chocolate Image; I Need Chocolate |  Kit  |  CC BY-NC 2.0

Data for dummies: 6 data-analysis tools anyone can use

I’ve spent about an hour playing with these tools, I’m loving Statwing, and will use it to analyse some of the data we’ve got on adoption of new technology. The Infogram tool also has potential to help present data in a more appealing way.


If you care only about the cutting edge of machine learning and how to manage petabytes of big data, you might want to quit reading now and just come to our Structure:Data conference in March. But if you’re a normal person dealing with mere normal data, you’ll probably want to stick around. Although your data might not be that big or complex, that doesn’t mean it isn’t worth looking at in a new light.

With that in mind, here are six of the best free tools I’ve come across for helping we mere mortals analyze our data without having to know too much about, well, anything (I’d keep an eye on the still-under-wraps Datahero, too). I’ve gathered some personal data and tracked down some interesting public data sets to help demonstrate what a novice can do with them. Someone with more skills can certainly do a lot more, and…

View original post 1,232 more words

“Let the Data set change your Mind Set”

Another fantastic TED talk on data visualisation, and how it might help us understand the enormous amounts of complex information we’re facing.

I love working on visualisation of information, and have had mild success in simplifying problem statements or project goals into single images. I admit I get a kick out of the moment when the data/information “clicks” into place and the diagram becomes clear and simple. I get another click when someone else’s response is along the lines of “ah, now I get it”.