Welcome to the Expert Series. This is the second of three blogs highlighting experts at Kibo to give a deeper look into their areas of expertise. Today: Haowen Chan, Director of Data Science and Data Architecture, Kibo.
We hear about data science, predictive algorithms, and big data all the time. What is the difference between these terms?
Data science and big data are new labels that have popped up in recent years to describe modern activities in the industry that are a part of long-established fields in computer science.
As early as the 1960s we have the concept of data mining, which means extracting patterns and predictions from data. Likewise, machine learning, or the discipline of designing systems that adapt themselves to data in order to achieve a specific task, has been known for a similarly long time. The term computer science itself has been in existence since the 1950s.
In academia, these fields are highly diverse and draw from a wide variety of disciplines related to computer science, including statistics, information theory, optimization, and numerical methods.
Things changed as we entered the Internet age. As a population, we started doing more and more of our activities in electronic media where data is much more easily collected and aggregated. Activities that were traditionally performed in physical space, such as shopping, interacting with friends and collaborating with colleagues, accessing information, and engaging with entertainment all started becoming online activities.
Because of this, people started generating an increasingly significant volume of electronic traces of their activity. Information such as what you looked at online, who you know, and what you bought were now valuable sources that could be used in new applications that could predict helpful things such as what films you should watch, who you should know, and what content on the web is most relevant to you.
As a response to the need to develop ever more sophisticated algorithms over increasingly large volumes of data, technologists in industry and academia designed systems to support general-purpose computations over massive volumes of data in an efficient, scalable, and reliable way. Highly scalable parallel computation systems like Hadoop and Spark enabled enterprises to implement complex machine learning or data mining algorithms over very large data corpuses that previously were feasible only for institutions with dedicated supercomputing capability.
This heralded the age of big data – referring to the ability to not just collect, store, and retrieve vast amounts of information, but to perform the complex computations necessary for supporting modern applications of machine learning, data mining, and statistical modeling.
The broad adoption of new, highly parallel software tools opened a new field of scalable systems engineering in addition to the known fields of statistical machine learning and data mining. In addition to the already-diverse skill sets in machine learning, a practitioner needed to be able to demonstrate the skills necessary for extracting and organizing the data for storage in a software system, and also the technical skillset needed to analyze, visualize, and implement applications based on the large volumes of data available. To distinguish this unique set of skills and differentiate from the more specialized machine learning researcher and the more general software engineer, people started self-identifying as data scientists. Today the term data science tends to refer to industry practitioners performing data analysis or software development of data-driven applications that involves building complex mathematical models using modern tools capable of handling the large volume and complexity of the data corpuses involved.
How are data science and big data changing the retail space?
I believe the fundamental definitions of engagement in retail are changing as a result of the increased instrumentation and intelligence of modern systems. We currently still have a lot of ground to cover in terms of simple facilitation: making it easy to find and order a specific product, increasing conversion rates, and so on. As catalogs and user bases grow, this problem will continue to be technically challenging. Technologies like search (using natural language, images, or inferring stylistic similarities) and personalized site experiences will get increasingly sophisticated.
However, the most advanced retailers are looking past that, toward concepts such as intelligent anticipation and inference of needs as well as direct engagement through social media and other online touchpoints. There are many potential applications of artificial intelligence (AI) in this space, for example intelligent sales or personal agents, and marketing agents designed to generate content specifically tailored to various online subcultures so that a brand can be more than one thing to more than one person. Retail has always chafed under the fact that advertising and sales interactions are crude and heavy handed. The holy grail here is for brands to use technology to directly integrate into the lives of individuals in a positive way, so that the retailer or brand is almost like a partner in your life helping you achieve your goals.
What importance should retailers place on data science? Where is it best applied?
There are numerous applications for data in retail. Many of these fall under Kibo’s product use cases. For example: multi-channel personalization and product recommendations; catalog management; order management and optimization, in-store and online data integration. In general, the challenges facing retailers in the data world fall into several categories depending on the maturity and capabilities of the retailer.
Data collection and instrumentation. This typically manifests in smaller retailers who have basic eCommerce systems (essentially, involving catalogs and orders), but no additional sources of data with which to leverage the more complex use cases like personalization. For example, customer registries may only track previous orders, and not site clickstream behavior. Perhaps marketing email open rates are not being monitored, and specific item engagement for these are not recorded. Alternatively, perhaps the metadata in the product catalog is incomplete. In each of these cases, the goal is to add additional instrumentation to start gathering valuable data. Kibo, for example, has observer tags that can collect the clickstream of users on our eCommerce sites.
Data integration. This challenge faces retailers who have multiple systems generating data but which are not integrated with each other. The problem then manifests as data silos. For example, a customer registry may contain all customer orders, but a clickstream tracker may track click data, but the data is stored in two separate systems, and each of them power an independent application which does not have a view of the other system. This limits the potential data applications deployable by the retailer and leads to increased cost of ownership of the data as each separate substore of data needs to be independently maintained. Investment in this is typically unfavorable to the retailer as many retailers do not wish to develop data processing and maintenance as a core competency.
The solution is to have a single core system that collects all appropriate data from various subsystems and applies the appropriate cross references to identify data linkages from separate stores. The new data store is then available for general purpose data applications, as we will discuss below. The Kibo data store is an example of such a data hub; using the latest data integration techniques, we pull information from various sources onto our big data framework to feed our data-driven applications.
Data applications. Once a retailer has achieved a high level of data integration, the challenge then becomes deriving maximum value from the data. The challenges here are twofold: (1) does your application suite support flexibly adapting its models and behavior to take advantage of various types of incoming data and (2) if you have an in-house data team with highly specialized domain knowledge of the specifics of your market, does your application framework support customization and interfaces that allow your data team to inject the results of their modeling and analysis without significant integration costs?
Kibo’s solutions offer a high level of functionality in both of these areas, and we are continuously working to extend them.
Q: How are you applying big data and data science to Kibo’s solutions?
Kibo’s big data architecture has been in production since 2011, making it one of the most mature deployments in the space. Of course, we have been constantly improving and rewriting the code, making it more streamlined, simple to integrate with, and efficient. Our systems use the latest technologies from open source, including Hadoop, Spark, and Mahout, and we combine these with proprietary high-performance libraries to perform fast inference at our RTX servers.
This extensible and well-maintained framework allows our data scientists to rapidly develop a wide variety of statistical models. Techniques used in our model include collaborative filtering, dimensionality reduction, many forms of regression, latent variable graphical models, reinforcement learning, and a wide variety of Bayesian inference methods.
These models drive much of the logic of personalization behind Kibo’s RTI product. As we continue to extend the architecture, we will bring the power of data science to all of Kibo’s product suite, improving the intelligence behind order management and fulfilment, as well as all tasks associated with the management of eCommerce. We’re working on creating a next-generation intelligent, optimized and adaptive e-commerce suite. Stay tuned for more as we unveil our new capabilities in the coming year!
For more information on personalization and real-time individualization, download The Ultimate Guide to Personalization.