Datakick Analytics

Darryl Buswell March 9th, 2020

Even if you are just starting to dip your toe into data science, one term you may have already come across is 'open source'. But what does open source mean for data science?

What is open source all about?

Even if you're just starting to dip your toe into data science, one term you may have already come across is 'open source'. And for good reason, as open source development is a big part of what has proliferated technologies like the internet across the world.

But what does open source mean exactly? Well, in short, open source solutions are those which are shared openly, and can be modified and improved on by the public. That means anyone... you, me, your neighbor, can make use of those solutions, and in many cases, can also take part in development of those solutions.

But how can we manage development of a project which is completely open? Well typically, collaborative development of open source solutions is facilitated via code versioning and community platforms. And while not always the case, many open source project tend to be collaborated on via git. Which in itself, is an open source version control system. These platforms allow submission, review, and approval of code changes from a huge number of collaborators.

If you want to see an example, then jump over to the GitHub repository for scikit-learn. There you will find thousands of developers collaborating on the development of one of the most powerful and fully featured frameworks for building machine learning models.

The advantages and disadvantages

So, what are the advantages of building your analytical solutions on top of open source tools?

First, open source tools are shared via licenses which comply with the open source definition. Or, in other words, open source tools are made available without cost. Now, you may think that by being made freely available, such open source tools represent something that has little to no value. That is definitely not the case. In the data analytics domain, some of the most powerful tools and solutions have been built via an open source model. And likewise, these solutions are broadly used, even by organizations which have large pools of dedicated data science and software development resources.

So, what are the disadvantages? Primarily, there is a need to implement and enforce controls around 'how' you or your organization use these tools. As, while open source tools often see rapid development via their large collaborative base of developers, that means the methods, standards, or results from using open source tools and technologies can vary quickly and unexpectedly. Not only that, but it’s important for organizations to understand their rights in commercializing solutions built on open source technologies. As, some open source projects may require certain license exceptions for commercial use.

What open source analytical solutions should you be aware of?

There are a huge amount to list here, and really, there can be a lot of subjectivity as to which open source tools are 'best'. But in saying that, there are definitely some household names when it comes to open source for data science.

Firstly, there's a list of projects built on the Python programming language. Scikit-learn, as noted above, is possibility one of the most noteworthy mentions on the machine learning front. But there is also pandas, which offers an extremely deep and powerful set of functions for data processing and analysis workflows. numpy which is incredibly handy for running scientific or mathematical functions in an optimized manner. Matplotlib, which offers a deep library for building detailed, data-driven visualizations. And on the more pointy end of advanced analytics, there’s Keras, PyTorch and TensorFlow which offer a very powerful set of machine learning and deep learning tools in a very democratized way.

And there’s a heap to be aware of for those who prefer R as a programming language. dplyr and tidyr are essential packages for data processing work. ggplot2 offers fantastic data visualization functions. caret and e1071 will have you more than covered on the machine learning front. And like with Python, you can leverage interfaces for libraries like Keras and TensorFlow in R.

And there’s also open source tools out there for those that don’t want to take a code-based approach. KNIME for example, offers a serverless option of their platform with no licensing cost, which allows users to build analytical workflows in a drag and drop fashion. And orange is great for more general data mining use.

Where should I start?

The Datakick Analytics team is well-versed in using open source tools to solve analytical problems. Please reach out, so we can help you leverage these tools.

What is open source all about?

The advantages and disadvantages

What open source analytical solutions should you be aware of?

Where should I start?

Sign up for our newsletter