I really want to use data science! Where should I start?

Economist Robin Hansen’s most viral tweet  says:

“Good CS expert says: Most firms that think they want advanced AI/ML really just need linear regression on cleaned-up data.”

I’d take this even further and say that most firms that think they want linear regression really just need good data visualization on trustworthy data.

And those that want data visualisation really do need it, but more importantly, they need to ensure their data is trustworthy before this.

Recently, I was discussing with a friend about their challenges at the company they had started working at. They wanted to build a linear regression or machine learning predictive model to predict success of potential candidates and hired employees based on a variety of factors: years worked, years of education, languages coded in, performance review information… The list of potential predictors was immense. The amount of data available on employees was likely small. The goal was extremely ambitious.

And I believed it was a waste of time and they should start with something more basic. Why?

Data Quality

My friend had no clue what state the existing data at the company was in. Before doing absolutely anything, it is critical to ensure your data is trusted. This may mean retagging any web analytics platforms, ensuring any human input data follows a standardised process, any CRM data has a similar customer ID as used in the web analytics platform, all online advertising media is appropriately tagged… Anywhere that data quality can be impacted should be assessed. If this step is skipped, it is possible that any future work will be wasted in that no one will trust the recommendations that come out of the data. If you want your data to inform transformational change in your organisation you don’t want something as fundamental as data quality to undermine your project. So, I recommended my friend really dig into the data and data quality before proceeding with any data projects, be them visualisation or model building.

Analysis 

My friend had not looked at any existing data in terms of what insights it currently can reveal. Before moving onto building a predictive model, what can you learn from your data through data visualisation projects? Usually, you can learn a lot, including whether building a model will someday be possible or worth it. Depending on the questions asked within an organisation, you will use data visualisation or build a model to answer it: rather than focusing on using the trendy tools of machine learning, focus on answering an important business question. Maybe this will someday need machine learning, maybe it will need an analysis and insights project. So, I recommended my friend work on data visualisation of their existing high-quality data and seeing what patterns and trends they could use to make decisions at their company.

Data Science

While linear regression and model building was what I was asked for help with, I actually didn’t give my friend many recommendations here. I figured that they had enough to chew off investigating data quality and analysis projects. But if you do ever get this type of project, make sure you have enough of the high-quality data and have extracted a lot of insights from it through data visualisation before beginning. Make sure the model will help the business move forward with data-driven decisions. Data science models can be highly effective in increasing ROI or influencing advertising targetings, such as through an attribution project, diminishing returns project, churn project, or customer lifetime value project, but make sure you are ready to use the results. So, I recommended my friend hold off on building a model and doing the first two recommendations.

Do it all again!

None of these steps should be considered done forever and then ignored. Go back and investigate data quality frequently! Always ask new questions of the data and visualise it in new ways to help answer those questions! Be curious and explore the data you have!

Conclusion

Jumping ahead a step can be enticing. But the problem with jumping ahead a step is you will get a reduced quality model or visualisation, which is less likely to be trusted to influence decisions. Or worst, it is used to influence decisions but in incorrect ways.

You have to start somewhere. Make sure you start at the right step and make sure you focus on the question more than you focus on what tool you want to use to ensure that above else you get high-quality recommendations.

 

Reminiscing about 2016 Data Science Conferences

Remember last year’s data science conferences? We are well into 2017 but I’ve been thinking about my favourite data science talks from 2016 as the same conferences from last year come up again. I’m excited to see what talks come out of the conferences this year! useR!2017 just released its schedule and I’m excited to view the recordings once they are available.

My favourites both come from useR! 2016: broom: Converting statistical models to tidy data frames – useR! International R User Conference and jailbreakr: Get out of Excel, free – useR! International R User Conference.

Learn why this was my favourite on Cardinal Path’s blog.

 

Attribution and Goodhart’s Law

Check out this post I wrote on Cardinal Path’s blog! This is one of my favourite blog posts. Goodhart’s Law seems to pop up everywhere! That’s why you need to think about how you measure your goals and targets and make sure they don’t skew incentives in an undesirable direction.

From the post:

“When a measure becomes a target, it ceases to be a good measure.” – Goodhart’s Law

Goodhart’s Law reminds us that oftentimes, setting a target (such as a KPI),  can change the way in which people work toward the goals these targets are meant to help us reach.

An absurd example of this occurred in India during the time of the British rule. There was an abundance of venomous cobra snakes loose on the streets, so the British government offered money for each dead cobra that was turned in. Rather than resulting in everyone catching many cobras and handing them in to collect their bounty, thus reducing the cobra population, people actually began to breed cobras and hand them over to the government for the bounty. This was obviously not a solution to the problem. To add to the problem, once the government caught wind of these breeders, they stopped paying the bounty, so the breeders released all of the cobras– resulting in an even higher population of cobras.

Read the whole thing.