Getting Data into Production

Data scientists seem to live in a world of experiments and assignments that are an end in themselves. This frequently comes as a surprise to many new Data Scientists who expect to make a difference only to be told thank you, that was interesting and here is your next task.

To try to counter this and start some discussion I would often ask “if I give you the answer tomorrow what will you do with it?”. Often the answer would be “depends on the answer” which at least was a start but the usual answer was just tell me. To some extent the question is unfair, as this is what happens many business analysts in this respect Data Scientists are no different. There is a question which needs an answer. Where Data Science can differentiate is in automation at speed. So this leads on to a second question of “how can we use this in our day to day operations?”. Here the emphasis is squarely on outcomes and action. Once again the usual answer was “ask IT”, but this hides the issue of how and where and when this will impact.

“Data Science is not a traditional IT metric, it is more of data journey. As we keep connecting ‘Things’ to the internet, new data evolves & so does the data journey ”

For example: You have a contract with a customer to supply some service. The Data Science team have produced a model to predict if they are likely to leave. So we have scored a customers on their likelihood of leaving, at what point do we consider them at risk, who do we tell, how do we tell them and when? Once we have told them what are they supposed to do with this knowledge? How does it fit into any existing processes, which usually embedded in an application. These applications are usually monolithic encompassing numerous aspects of an enterprise, chosen as a best of breed which usually means there are multiple discrete applications within one enterprise. This create further complexity as each vendor will be trying to gain lock in into their universe not letting data in or out.

Not that IT can escape from their involvement either. Many IT organisations are unprepared for such assaults into their domain. To some extent producing output is quite simple, parallel IT, bimodal IT or just use the cloud. What is more important is the wrapper which need to be around such it such as monitoring, security and privacy. All of which are perfectly valid concerns for IT to have, but often they are used as barriers to delaying the inevitable outcome of IT changing faster than they are able to cope with. Equally having two Data Science environments must be avoided as well, duplication is not a good state to be in, Development and production must be the same environment. That is to say not necessarily a single environment but a duplicate where objects can be transferred seamlessly, each connected to the same data store.

So in conclusion whatever your starting point is how to put any findings into production must be considered from the start. It will affect all of the organisation, IT need to provide the framework and architecture for Data Science to operate in and the operational side need to be able to consume outcomes with the minimum of additional work. So much of this is dependant on the stance taken by existing vendors within your application stack but ultimately the enterprise is in control of its own path.

    Leave a Reply

    Your email address will not be published. Required fields are marked*