Data scientists & the data warehouse team-building success

Data-Scientist-Tech-200Whether you are a CIO, data architect, or a data management professional, it is imperative to understand the different approaches, attitudes and needs of the next generation of data warehouse consumers. Traditional data warehouse users include reporting teams, BI teams (who created reports for the rest of the company), statisticians and others. In the past few years, this has been rapidly changing with the new roles of data scientists, the rise of Data Enthusiasts and the burgeoning population of Accidental Analysts. In Part 1 of this series, we focus on successful collaboration between data scientists and data warehouse teams.

Data scientists have been with us for many years. However, the moniker “data scientist” is a recent change. The same role existed (and still exists) with titles such as statisticians, mathematicians, computer scientists or systems analysts; however, having one of these alternate titles doesn’t necessarily imply that one is a data scientist, although a wide range of techniques may be used by both groups.

Traditional training for people now in data science focused on theoretical methods, techniques and ideas; so many data scientists often possess a PhD or Master’s degree. However, the field is rapidly changing. While there are many data scientists with PhD’s or Masters’, that is not the key prerequisite. Instead, being an effective data scientist is about combining computer programming, systems management and a sophisticated approach to using data to solve complex problems. Many fields intersect in data science including statistics, math, physics, economics, computer science, operations research and even psychology and related social sciences.

Here are a few approaches to characterize the work of data scientists:

Definition from 2005
Data scientists are people who
- conduct creative inquiry and analysis
- perform successful management of digital data
- utilize methods for data visualization and information discovery
National Science Foundation

Definition from 2009
Data science is
Statistics +
data munging (management) +
visualization
Mike Driscoll
CEO of MetaMarkets

What is data science? My current thoughts
I would say that data scientists are distinguished by creating systems that automatically run and improve over time. This is definitely one area that they work in; however, it is not a complete description. Often, data scientists work in other areas such as data mining, forecasting, advanced analytics, data management, systems design and systems architecture.

To look at this new field from yet another perspective, here’s an alternate explanation:

Data-science-one-flow

I have used this flow starting from both directions in a live presentation, and here’s how I explained it. Data scientists use discovery to model new approaches to problems. They combine systems and applications to integrate these new approaches into the business process. An example would be exploring people’s viewing behavior on a streaming video service. After exploring this data, I could model a set of preferences that could help them find additional content they might like. Using our systems on our video service, I can enhance or create a new application that integrates into the user experience.

Data warehouse considerations
As far as the data warehouse goes, data scientists will likely use the data warehouse when possible, but will rarely request immediate help from the data warehouse team. In fact, they will use tools such as Python, Informatica, SAS or other tools/programming languages to bring together the data they need, wherever it may be. This does not mean that they would not appreciate help from the data warehouse team, but if they lack the needed data for their project from your data warehouse systems, it will rarely stop them! It will simply make their projects progress a bit slower as they go back to source systems and craft custom feeds from them.

What data scientists need from the data warehouse team
Data scientists are quite happy to find that their primary data sources are already in the warehouse. Data sources not in the warehouse will be found either in the source systems, stealth department data marts, Excel spreadsheets or from third-parties. Data scientists would benefit greatly from having access and training on a solid data integration tool like Informatica and access to either a development instance or isolated production ETL server. Barring that, tools like SAS and Alteryx could be a good addition to their open-source toolkits.

Work style
Unlike Accidental Analysts or Data Enthusiasts, data scientists often work on multi-week to multi-quarter projects.  As a result, they may develop extensive data models on data not yet in the data warehouse, work which may be directly transferable or useful to future warehouse projects.

In general, data scientists appreciate the opportunity to share their current project approach, failings, logjams, objectives, tools and more. These are creative learners who may overwhelm a data warehouse team who, in turn, must manage a smaller toolset by necessity in order to have a reliable, manageable, well-tuned data pipeline. Unlike the data warehouse team, the day-to-day work of data scientists is not tactically critical, so a day without their latest updates is likely not “do or die”. Once again, their proclivity to bring together multiple toolsets is one of their biggest strengths and weaknesses in working with the warehouse team.

Missed opportunities
In contrast to data warehouse teams, data scientists often create production systems that aren’t necessarily targeted at broad access from business intelligence or analytic tools. Instead, their work often becomes part of production front-end systems like checkouts, product catalogs, risk scoring flows in loan applications, etc. Due to this, a huge opportunity is often missed by the two teams: integrating their analytic models BACK into the warehouse for easy access to all data consumers! Great data integration tools like Informatica will allow easy integration of their code from Python, Java, scripts, C++ and more into the warehouse daily or weekly updates.

So, while it is great to help the data scientists, they are often not the most important audience for data warehouse teams to target for assistance and support. However, they are kindred spirits that can jointly learn.

[Post-Bottom-1] [Mail-Chimp-Signup] [Post-Bottom-2]