Data science is an exciting field to work in, combining advanced statistical and quantitative skills with real programming ability . The aspiring data scientist might consider specializing in many potential programming languages.
While there is no correct answer, there are several things to consider. Your success as a data scientist will depend on many factors, including :
When it comes to advanced data science, you can only reinvent the wheel every time . Learn to master the different packages and modules offered in the language of your choice . The extent to which this is possible depends primarily on the domain-specific packages available to you!
A high-level data scientist will have good programming skills, as well as the ability to cut numbers. The bulk of the work daily in data science is devoted to research and processing of raw data or “data cleansing” . For that, no amount of sophisticated machine learning packages will help.
In the rapidly evolving world of business data science, there is a lot to be said about getting the job done quickly. However, this is what allows technical debt to creep in – and it is only through good practice that it can be minimized .
In some cases, optimizing the performance of your code is essential , especially when dealing with large volumes of critical data. Compiled languages are generally much faster than interpreted ones; Likewise, statically typed languages are considerably more reliable than dynamically typed languages. The obvious trade-off is against productivity.
To a certain extent, they can be seen as a couple of axes (specificity generality, performance-productivity). Each of the languages below falls somewhere on these spectra.
With these fundamentals in mind, let’s take a look at some of the most popular languages used in data science.
Released in 1995 as a direct descendant of the old S programming language, R has grown increasingly powerful . Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Informatics .
- Excellent range of high quality, domain specific and open source packages . R offers a package for almost every quantitative and statistical application imaginable. This includes neural networks, nonlinear regression, phylogenetics, advanced tracing, and many more.
- The basic installation includes very comprehensive and integrated statistical functions and methods. R also handles matrix algebra very well.
- Data visualization is a big plus with the use of libraries such as ggplot2 .
- Performance. There are not two ways, R is not a fast language .
- Domain specificity. R is great for statistics and data science. But less for general programming.
- Oddities. R has a few unusual features that might appeal to programmers experienced in other languages. For example: indexing from 1, use of several assignment operators, unconventional data structures.
R is a powerful language that excels in a wide variety of statistical and data visualization applications , and being open source allows for a very active community of contributors. Its recent growth in popularity is testament to its effectiveness.
Guido van Rossum introduced the Python language in 1991. It has since grown into an extremely popular, versatile language widely used by the information technology community. The major versions are currently 3.6 and 2.7 .
- Python is a very popular and very popular general purpose programming language. It offers a wide range of specially designed modules and community support. Many online services provide a Python API.
- Python is an easy language to learn. The low barrier to entry makes it an ideal mother tongue for programming beginners.
- Packages like pandas , scikit-learn, and Tensorflow make Python a solid option for advanced machine learning applications.
- Type Safety: Python is a dynamically typed language , which means you need to be extremely careful. Type errors (such as passing a string as an argument to a method that expects an integer) should be expected from time to time.
- For statistical and specific data analysis purposes, R’s wide range of packages gives it a slight edge over Python. For general purpose languages, there are faster and more secure alternatives to Python.
Python is a great choice of language for data science, and not just for beginners. Much of the data science process revolves around the ETL (extract-transform-load) process. This makes the generality of Python perfectly suited. Libraries like Google’s Tensorflow make Python a very interesting language to work in for machine learning.
SQL (‘Structured Query Language’) defines, manages and queries relational databases . The language first appeared in 1974 and has since undergone numerous implementations, but the basic principles remain the same.
Varies – some implementations are free, other owners
- Very efficient for querying , updating and handling relational databases.
- The declarative syntax makes SQL an often very readable language.
- The SQL language is widely used in many applications , which makes it a very useful language to know. Modules such as SQLAlchemy make it easy to integrate SQL with other languages.
- SQL’s analytical capabilities are rather limited – beyond aggregating and summing, counting and averaging data, your options are limited.
- For programmers coming from an imperative background, SQL declarative syntax can present a learning curve.
- There are many different implementations of SQL, such as PostgreSQL , SQLite , MariaDB . They are all different enough to make interoperability a headache.
SQL is more useful as a computer language than as an advanced analytical tool. Yet a large part of the data science process relies on ETL, and SQL’s longevity and efficiency prove that it is a very useful language for the data scientist to know.