Let's make some specific call-outs on the types of tools, language, technique and methods data analytics professionals commonly leverage.
Taking a closer look at data skillsets
We recently published a post including our take on the Data Scientist Development Pathway. And while that post did highlight various skills over a range of data science focused roles, we thought it would be worth making some more specific call-outs on the types of tools, languages, techniques, and methods data professionals commonly leverage.
Similar to our previous post, we have looked to separate these skills according to 'seniority' of position. But you may note we have also included a 'Data Analytics Foundation' layer. The more traditional Data Analyst or Business Intelligence Analyst roles may relate well to this layer. And it may be that the data analytics skills and expertise of domain experts (those who leverage data, but do not exclusively hold a data analytics role) may also relate well to this layer. We do believe it's an important inclusion, as it will help frame the skill progression for domain experts seeking growth towards a Citizen Data Scientist role, or for Data Analyst/ Business Intelligence Analysts seeking a pathway towards becoming a Data Scientist.
Keep in mind that the same caveats apply, as highlighted by our previous post. Namely, that this framing is not intended to be a one size fit all model, and organizations will need to carefully define roles and skills which are suited to their business.
Data Analysis Tools and Languages
Data Analysis Techniques
Data Management and Processing
Data Reporting and Visualization
Data Analytics Foundation
Use of Microsoft Office Applications - Writing formulas using Microsoft Excel, and use of Excel's built-in data features.
May make use of dedicated business intelligence tools - Microsoft PowerBI or Tableau.
Able to segment, filter and represent data in informative ways - data views and filters, pivot tables, and making categorizations.
Employs common statistical techniques and measures - averaging, correlations, regression analysis, and interpreting data distributions.
Can work with common data file formats - delimited text files (comma separated CSV's, and tab separated TSV's) as well as spreadsheets (XLSX).
May make use of standard SQL - SELECT, UPDATE, DELETE, and INSERT INTO.
Able to generate data stories and visualizations using Microsoft Office Applications - Using built-in charting features within Microsoft Excel and present information via Microsoft PowerPoint.
May use dedicated business intelligence tools to generate reports, snapshots and interactive dashboards - Microsoft PowerBI or Tableau.
Citizen Data Scientist
Makes use of organizations preferred analytics platform/ tools - Particularly those supporting low-code or no-code usage, e.g. RapidMiner, SAS, KNIME, Orange, or Weka.
Can leverage existing notebook-format coding workflows - Consuming and making basic adaptations to workflows based in Jupyter Notebooks or other interactive coding reports/ formats.
Some basic coding or scripting ability - Able to leverage Python/ R base functionality, to the extent which allows re-use and modifications of existing coding workflows. Including syntax familiarity, understanding of object types, module importing, list creation and using loops etc.
Builds data science workflows - Make use of low-code or no-code analytics platforms in order to create and apply data science workflows within their domain area.
Leverages common Machine Learning techniques - Variable importance metrics, regression-based models, decision trees, logistic regression, and clustering methods.
Can consume, interpret and modify existing data science workflows as required - Feature and model substitution, assess model performance, and diagnose workflow errors or abnormal results.
Can work with less common data file formats - object based files (Javascript Object JSON files), or other proprietary file formats.
Uses more advanced SQL methods - JOINs and UNIONs, view materializations, and can make query optimizations.
May be able to make use of Python/ R data processing libraries - data loading, filtering and transformations via Pandas (Python) or dplyr/ tidyr (R).
Makes use of embedded reporting/ visualization capability, within organizations preferred analytics workflow tools - RapidMiner Visualize Model, SAS BI/ Reporting, or KNIME's built-in visualizations.
May make use of standard Python/ R data visualization libraries - matplotlib (Python) or ggplot (R).
Data Scientist
A much deeper knowledge of the organizations preferred analytics platform/ tools - Able to create plug-ins, extensions, scripts and automations within platforms such as RapidMiner, SAS, KNIME, Orange, or Weka, for others, including Citizen Data Scientists, to use.
Good coding skills - Ability to create custom definitions and classes, handle exceptions, debug code, and has good familiarity with a preferred Independent Development Environment (IDE).
Broad familiarity of Python and/or R libraries relevant to data workflows - Pandas (Python), NumPy (Python), Scikit-learn (Python), dplyr/tidyr (R), and caret (R).
Can leverage a wider range of Machine Learning techniques - Random Forests, Support Vector Machines, standard Neural Networks, nearest neighbors, and dimensionality reduction methods.
Good understanding of parameter option ranges for Machine Learning techniques, including trade-offs in parameter choices and values - splitter functions, kernel choices, maximum tree depths, layer counts, random states, etc.
Can build and optimize Machine Learning workflows using automated means - hyperparameter optimization, cross-validation methods and feature selection algorithms.
Has a good understanding of Python/ R data processing libraries - method chaining, lambda functions, groupby functions via Pandas (Python) or dplyr/ tidyr (R).
Can likely make use of No-SQL database systems and tools - key queries, graph traversals, and geospatial queries via MongoDB, Redis or other dedicated No-SQL cloud-based service.
Senior Data Scientist
Advanced coding skills - Able to create powerful code-based workflows following an object-oriented approach, whilst targeting code optimizations, making use of vectorizations, multiprocessing, decorators, and C-extensions such as Python C.
Good grasp on some of the more deep and complex Python and/or R libraries - Able to leverage libraries, such as Dask (Python), PySpark (Python), and TensorFlow (Python/ R) to tackle complex data science tasks and work with extremely large datasets.
Uses more advanced learning methods to draw insights from larger and more complex data sets - Deep Learning (Convolutional/ Recurrent/ Recursive Neural Networks, auto-encoders, Long Short-term Memory Networks), or Reinforcement Learning methods.
Can make use of niche methods for specific problem-types - Natural Language Processing (tokenization, sentiment analysis, and topic modelling), time-series analysis, image classification or audio signal processing.
Custom and targeted use of distributed data processing libraries and systems - Able to leverage libraries such as Spark/ Pyspark (Python) or Dask (Python) to handle computationally challenging and scalable problems.
Makes use of big-data platforms and tools - Leverages scalable compute features within systems such as Google BigTable, Amazon DynamoDB, or Azure Cosmos DB to tackle manipulation, transformation and analysis of large data arrays.
May make use of script languages to make completely custom and interactive visualizations - D3.js, Chart.js or HighCharts.js