Economist Robin Hansen’s most viral tweet  says:

“Good CS expert says: Most firms that think they want advanced AI/ML really just need linear regression on cleaned-up data.”

I’d take this even further and say that most firms that think they want linear regression really just need good data visualization on trustworthy data.

And those that want data visualisation really do need it, but more importantly, they need to ensure their data is trustworthy before this.

Recently, I was discussing with a friend about their challenges at the company they had started working at. They wanted to build a linear regression or machine learning predictive model to predict success of potential candidates and hired employees based on a variety of factors: years worked, years of education, languages coded in, performance review information… The list of potential predictors was immense. The amount of data available on employees was likely small. The goal was extremely ambitious.

And I believed it was a waste of time and they should start with something more basic. Why?

Data Quality

My friend had no clue what state the existing data at the company was in. Before doing absolutely anything, it is critical to ensure your data is trusted. This may mean retagging any web analytics platforms, ensuring any human input data follows a standardised process, any CRM data has a similar customer ID as used in the web analytics platform, all online advertising media is appropriately tagged… Anywhere that data quality can be impacted should be assessed. If this step is skipped, it is possible that any future work will be wasted in that no one will trust the recommendations that come out of the data. If you want your data to inform transformational change in your organisation you don’t want something as fundamental as data quality to undermine your project. So, I recommended my friend really dig into the data and data quality before proceeding with any data projects, be them visualisation or model building.

Analysis 

My friend had not looked at any existing data in terms of what insights it currently can reveal. Before moving onto building a predictive model, what can you learn from your data through data visualisation projects? Usually, you can learn a lot, including whether building a model will someday be possible or worth it. Depending on the questions asked within an organisation, you will use data visualisation or build a model to answer it: rather than focusing on using the trendy tools of machine learning, focus on answering an important business question. Maybe this will someday need machine learning, maybe it will need an analysis and insights project. So, I recommended my friend work on data visualisation of their existing high-quality data and seeing what patterns and trends they could use to make decisions at their company.

Data Science

While linear regression and model building was what I was asked for help with, I actually didn’t give my friend many recommendations here. I figured that they had enough to chew off investigating data quality and analysis projects. But if you do ever get this type of project, make sure you have enough of the high-quality data and have extracted a lot of insights from it through data visualisation before beginning. Make sure the model will help the business move forward with data-driven decisions. Data science models can be highly effective in increasing ROI or influencing advertising targetings, such as through an attribution project, diminishing returns project, churn project, or customer lifetime value project, but make sure you are ready to use the results. So, I recommended my friend hold off on building a model and doing the first two recommendations.

Do it all again!

None of these steps should be considered done forever and then ignored. Go back and investigate data quality frequently! Always ask new questions of the data and visualise it in new ways to help answer those questions! Be curious and explore the data you have!

Conclusion

Jumping ahead a step can be enticing. But the problem with jumping ahead a step is you will get a reduced quality model or visualisation, which is less likely to be trusted to influence decisions. Or worst, it is used to influence decisions but in incorrect ways.

You have to start somewhere. Make sure you start at the right step and make sure you focus on the question more than you focus on what tool you want to use to ensure that above else you get high-quality recommendations.