Learn to code for data: a pragmatist’s guide

7 minute read

A recipe to go from spreadsheet to code

Spreadsheet applications like Microsoft Excel and Google Sheets are great. They’re hard to beat when you want to perform simple calculations or complete some financial modelling.

However, there comes a point when you should ditch the spreadsheet and go with another solution. I’ve seen some hideous spreadsheet reports in my time! When you start using your spreadsheet as a report, a database, and as a data transformation tool, you need to stop! You have gone too far. These spreadsheets are nightmares to maintain and debug.

So what’s the alternative? The alternative is to learn to code for data analysis! You’ve no reason to be scared of code. If you already know how to use a spreadsheet program, you have plenty of analogies you can draw from to make this an easy process. I’ve mentored several analysts through these stages of their development before. I’ve seen the approach outlined in this article work time and time again!

Let’s get started!

Prerequisite: Learn how to use a spreadsheet to analyse data

If you don’t know how to use a spreadsheet application, this is where you should start. This is where you’ll learn the data-related concepts that’ll make your transition into ‘coding for data’ easier.

Starting with zero experience

If you’re starting with zero experience, then I’d recommend taking a free spreadsheet course on YouTube. There are many of these so just pick one and complete it.

Starting with some experience

If you’re starting with some experience, then you just need to practice until you feel comfortable with spreadsheets. There are many open-source data sets available online. Come up with 5 questions you could ask of the data and answer them using those data sets. Some questions you could ask are these:

“What is the average value in column X?”
“How many cells are blank/contain missing values in column X?”
“Sort column X in descending order.”
“Create a line chart of column X.”
“What is the sum of column X for each of the values in column Y?”

The concepts you will be learning

Regardless of your experience level, your adventures in spreadsheet land will teach you some valuable concepts that apply to ‘coding for data’:

  • Formulas can be applied to individual cells or to a bunch of cells. This is similar to the concept of ‘vectorisation’ where we apply operations to entire arrays.
  • Data can be summarised through aggregation. Pivot tables will teach us what it means to count or sum columns by groups defined in other columns.
  • Visualisations are powerful. Pivot charts will allow you to understand which types of charts work for different types of data.
  • Two data sets can be ‘merged’. We can learn this through the power of formulas like ‘VLOOKUP’. Once you feel ‘fluent’ in using your spreadsheet application of choice, it’s time to graduate to the land of code!

Going from spreadsheet to code

Languages to focus on

Microsoft Office users might be thinking:

I use Excel heavily and I’ve heard about ‘VBA’. I heard its code can be used to automate my Excel reports. Should I learn it?

No! Avoid VBA. I say this as someone who has written a lot of it in the past. You’ll be better off learning a ‘transferable skill’ — a skill that you can take to your next employer regardless of whether they happen to use Microsoft Office.

Assuming that you have no programming experience, I think that you should focus on SQL and R.

Learning SQL is a no-brainer. It’s the language of relational databases which are found everywhere! You must learn it.

But why R? Why not Python? Because it’s made for data analysis! As it’s been designed with this specific purpose in mind, it’ll allow you to spend more time analysing your data instead of having to grapple with the abstract aspects of programming languages. For example, R comes with built-in data sets like ‘mtcars’ that you can start analysing straight away. CSV files can be imported in one line by using ‘read.csv()’. You can create some (ugly) histograms using ‘hist()’. You can create scatterplots and more by using ‘plot()’. The point is, these are all built-in features of R which allow you to start analysing your data seconds after you have finished installing it!

I want to be clear: I’m not saying that you shouldn’t learn Python. I absolutely love Python and it is my language of choice! You should absolutely learn Python. Just don’t learn it now. My ulterior motive is that by getting you to learn R first, you will be able to experience some coding-related successes right away. The hope is that these small successes will keep you, the coding student, motivated to continue to pursue the rewarding craft that is ‘coding for data’.

There’s so much I could learn about these languages! Which aspects should I focus on?

We want to be pragmatic here and focus our energy on learning things that we’re likely to use in our jobs as analysts. To come up with our list of things to focus on, let’s follow this simple recipe:

  1. Make a list of all the spreadsheet-based reports you update on a regular basis. Add to that list all the ad hoc pieces of analysis that you’ve performed using spreadsheets over the last six months.
  2. Take a five-minute break because this can be tiring!
  3. Open one of the spreadsheets on your list.
  4. Set a timer for ten-minutes.
  5. In a second list, take note of some of the formulas you’ve used in the report. Take note of the visualisations you’ve created. Also, take note of any numbers calculated using pivot tables. Be as broad as possible! There’s no need to be specific about how you performed a ‘VLOOKUP’ using exact matches in column B, while returning the values in column F. Just take note of the fact that you performed a ‘VLOOKUP’. Next to each formula, keep a tally of how many times you’ve encountered this formula across your spreadsheets. Keep going until you run out of time or move onto the next spreadsheet if you finish before the ten minutes are up.
  6. Take a five-minute break and move onto the next spreadsheet on your list.

Keep working through the list until you’ve had enough! Sort your list in descending order by the number of times each formula/visualisation/pivot table value appeared across your reports and analysis. This is the order of priority in which you should conduct your learning!

How should I go about learning these things?

We will divide up our learning into two, twenty-five-minute sessions (two Pomodoros for those Pomodoro Technique practitioners). Complete these two sessions before work! We are all weak after our days of hard work. The later we leave our sessions in the day, the less likely it is that we will complete them at all.

For the first twenty-five minutes, we focus on learning R:

  • Take the first thing on the list and learn how to do that thing in R. Prioritise learning how to do it using the friendly dplyr package. If you can’t find out how to do it using dplyr, then broaden your search to look for how to do that thing using R in general.
  • If you have any data sets from work you could use, then use them. If not, use the built-in R data sets or any open-source data sets that you can find online.
  • Once your twenty-five minutes are over, ask yourself whether you can apply this skill fluently. If you can, cross this first skill of your list. You can move onto the next item on the list tomorrow. If you struggled, we will continue practicing the skill tomorrow.

For the second twenty-five minute session, we focus on learning SQL:

  • Work through the free SQLZOO course.
  • Once you’ve completed the course, import some data into this online SQL environment and work on aggregating and joining your tables. Be sure not to upload any work-related data!
  • Once you can fluently aggregate and join tables, learn to apply some window functions.
  • Once you can do all of this fluently, you can stop learning SQL. Replace the twenty-five-minute SQL session with a twenty-five-minute R session.

As you become fluent in R and SQL, look for ways in which you can start using your new skills at work. Feel how much more power you have over your data now that you know how to code!

I’ve completed my list! What should I do now?

Don’t get stuck doing the basic things. Keep pushing yourself. There is more to life than pulling lists of data and running reports!

Start learning some fancier things. For example, work through this free book by Garrett Grolemund and R legend Hadley Wickham. Find out what Kaggle is. Subscribe to R-bloggers and learn from your fellow R users.

Work on R daily for a solid six months. Once you feel like you are fluent in R, start thinking about learning Python. Now that you have two programming languages under your belt, your move to Python will be much easier!

Conclusion

Watching the colleagues that I have mentored grow from spreadsheet data analysts to coding data analysts have been some of the most rewarding moments in my career so far. Sadly, I can’t be there to personally guide you through your transformation! I hope that this guide gives you enough information for you to take your first step towards levelling up your skills to becoming a more powerful data analyst.

You can do it!

Justin