2 What is R Programming?

Before getting started in R, you need to know what R programming is and what’s so special about it.

R is a programming language built by statisticians for statistical analysis. It is well-suited towards running advanced analysis, building predictive models, machine learning, and even more routine data management.

R is open-source, which makes it free. The people who developed it wanted to make it accessible to anyone. Since then, R has moved beyond its humble beginnings and developed into a full-blown community of developers.

Those developers have expanded the R programming language to allow more advanced machine learning techniques, application development, website development, and even the publishing of digital books, which is how you’re reading this book now.

This expansion is possible because of packages. Packages allow individual developers and teams to share new functionality with the broader community.

And that to me is the best thing about the R programming language. There’s a vast community, filled with individuals, teams, and whole organizations, who regularly contribute and improve R.

This provides us with multiple options for accomplishing our goals. If we find a package confusing or hard to use, there’s a good possibility of finding another that’s easier and more intuitive.

If you’re wondering if there’s a catch, here it is: you have to learn the basics first.

One of those basic concepts include objects. We’ll cover this extensively in another chapter, but R is an object-oriented programming language. That means to succeed with it, you have to understand the various object types and how they’re used.

If that sounds overwhelming, don’t worry. It’s easier than you think. And focusing on these core components will go a long way to helping you succeed in R. You’ll find these key principles show up again and again throughout your R journey.

2.1 Is R the Best Statistical Programming Language?

That’s a matter of debate and boy do people debate it. The great thing about R is that, unlike many other open-source programming languages used in data science, it was originally built for statistics.

That makes the default functionality well-suited towards statisticians and researchers alike. It’s quite possible that, as a researcher or analyst, you could produce the analysis you need without compiling a large list of extra packages.

The two big competitors to R that I hear about most often are SAS and Python.

SAS is a statistical programming language built by a for-profit company. It’s the opposite of open-source, which means you have to pay money to use it. That’s not necessarily a bad thing though. It means there’s more rigorous quality testing because that’s what people expect from software that costs money. Some R packages may not be as rigorously quality checked.

The SAS Institute (the people who make SAS) have a lot of market power too. They have ingrained themselves into academia and pharmaceutical research. Even though R and Python are the preferred choices for most newer companies and younger professionals, SAS has the advantage of being legacy code.

You’re probably not familiar with that term if you haven’t ever programmed before. What it means in the practical sense is that even if R and Python were better alternatives to SAS, it would cost an organization a lot of money and time to go back and re-write all their automated, enterprise-wide code.

Python is a more interesting competitor of R, mostly because it wasn’t built for statistical programming. Like R, it’s open-source and that has led to data scientists building packages that make statistical programming possible.

Python is probably more popular with data scientists than researchers and statisticians. Since data scientists are often joined at the hip with computer programmers, it naturally lends itself more to that work.

I once worked with a company that ran all of its ETL (extract, transform, and load) processes off of Python. Since the data engineers and architects already used Python for that purpose, it wasn’t difficult for them to learn to use it for data science as well.

Is Python better than R? I don’t know because I haven’t programmed much in Python. In my opinion, R is easier to learn for people who’ve already programmed in SQL, since it has a more direct syntax. Other people say that Python is easier to learn for programming newbies because it’s written more like the English language. I didn’t find that to be the case, myself.

Some people also say that Python is better for enterprise-level or “production” grade solutions. I don’t really believe that either. Since I started writing this book, I’ve helped administer a few data science tools for clients. I’ve learned through experience that Python is a tough programming language to manage, from an administrative perspective. There are lots of little rules and nuances, such as the operating system you use, and the manner in which you install Python versions, packages, and environments.

For those reasons, I suggest learning R. It’s easier to download, easier to manage, and easier to learn in my experience – especially for individuals.

2.2 Things to Remember

  • R programming is a free and allows for advanced statistical programming
  • R is managed by a community of developers, who contribute to its growth through packages
  • You must understand object types to succeed with R
  • Picking R over Python (or vice versa) probably won’t have a big impact on career opportunities