The theme of this year’s ASCB|EMBO Meeting is Cell Biology for the 21st Century. So what skills are essential for a cell biologist to master in the 21st century? Depending on the project, you might answer “microscopy,” “gene editing,” or “primary cell culture.” But why don’t we think of “data analysis,” “statistics,” or “coding”? These computational skills are as important to the modern cell biologist as any wet lab technique, yet they can often be overlooked during training.
Many years ago, cell biology papers were mostly qualitative. Micrographs showed “representative cells” and Results sections were descriptive. Nowadays, this style is no longer welcome and observations must instead be measured and objectively assessed. And because technology has advanced, computing power has increased, and datasets have become more complex, modern cell biologists need sharper quantitative skills and some programming experience to continue to make important advances in the field.
I know what you are thinking: “He doesn’t mean me. He’s talking about computational cell biology.” Before you turn the page, I am talking about you. I’m talking about all of cell biology—because modern cell biology is computational. While some areas of cell biology—for example, biophysics, modeling, and large-scale systems projects—have used computational approaches for decades, nowadays all cell biology benefits from quantitative and computational methods.
The key skills to learn fall into three categories:
- Data analysis: getting information out of cell biological images or other data types
- Statistics: summary statistics and their correct usage, effective data visualization, null hypothesis statistical testing, p-values, power analysis, and experimental design
- Coding (computer programming): automated data analysis, reproducibility, and version control
We’ll focus on coding for the rest of this article.
When incorporating coding into your experiments, the aim is to automate data analysis and to minimize any manual steps. This means that potential errors are reduced and your analysis becomes reproducible. Ideally, given your raw data and your code, anybody should be able to reproduce your results. Anybody includes the future you.
Many programming languages and software packages are available, but two excellent resources for cell biologists are ImageJ (http://fiji.sc) and R (https://rstudio.com). These programmable and extensible software packages are free, cross-platform, and open source. With them you can handle pretty much all the image analysis and number crunching you will need to do. What’s more, they are widely used, so getting help while you are learning to code is straightforward.
Automating Analysis Using Coding
Your current workflow might involve doing some manual steps in ImageJ, followed by copying and pasting the data into Microsoft Excel, typing in a formula or two, and then making a graph of the results. The alternative is to write a simple macro for the ImageJ steps and then have an R script that automatically processes the data and makes the graph (see Figure 1).
The problem with the manual workflow is that it is not reproducible and is error-prone, with the extra complication that those errors are very hard to spot. If that doesn’t convince you: Think about the time taken. Let’s say that the manual workflow takes you one day, whereas writing the code takes almost two days (don’t worry, you will get faster!), and running it takes seconds. The manual workflow is faster for the first experiment, but when you get your next dataset and the one after, the automated version saves you time. Oh, you wanted to tweak something? No problem. Make the change to the code and click “go.” With the manual analysis you would need another day at the computer, per dataset.
If That Convinced You, How Do You Get Going?
There are so many ways to learn coding that there are no excuses. It requires effort, though, and you may find the learning curve steep. There are lots of free resources online, books in the library, or maybe short courses offered at your institution. Whatever works for you, just do it! In my view, what works best is learning by doing. So the next small analysis problem you have, set yourself the challenge of automating it as best you can. Start small and celebrate your first few lines of code that do what you want them to do. Remember the goal is not to become an expert computer programmer, just a competent one.
Don’t be afraid to ask for help. The best programmers in the world often have to Google the simplest commands. For image analysis, https://forum.image.sc is a fantastic resource for coding in ImageJ or a number of other packages. For R, rseek.org is a useful search engine, while questions tagged [r] in StackOverflow are an invaluable resource. Finally, you can learn a lot by reading other people’s code and learning how they tackled similar problems. GitHub and other code repositories are fantastic for this purpose.
Once you have written some code (and it works!), it is time to share it with the world—or at least with the rest of your research group. Having other people run your code is a great way to find bugs or limitations. You can then refine your code to make it more robust. Posting your code along with your preprint or final paper publication is now standard practice.
In the past, cell biologists shared only a summary of their results in their papers. In the 21st century, we will share our datasets and our code as well as our results in our publications. The direction of travel in science is toward transparency, openness, data sharing, and data reuse. Get on board: These are exciting times to be a cell biologist.
About the Author:
Stephen J. Royle is Professor of Quantitative Cell Biology at Warwick Medical School, UK. He is the author of The Digital Cell: Cell Biology as a Data Science published by Cold Spring Harbor Laboratory Press (ISBN 978-1-621822-78-3).