and therefore much tighter in this application than with the CF-based ReCo. Typically the targeting model is recalibrated every 4 - 5 hours as opposed to the 24 hour refresh interval used in ReCo.
The details of this mobile media model management environment blending automated modeling, real-time model feedback loops and RS technology are the subject of ongoing research. The ReCo project we have described has laid important groundwork as we investigate how to leverage automated modeling technology and big data to enable more effective service system management and engineering.
5.2. RS Generators: Mobile Media and “Big Data”
In order to provide RS as a software marketing service to clients, it is desirable to make this architecture as reusable and open as possible . As mentioned earlier, RS tend to be application-specific and proprietary in nature, and there is little attention in the literature to the generalization of these systems to enhance reusability. In this section, we sketch some ideas for developing a RS generator (RSG) environment which can be quickly adapted to specific application domains and requirements, and therefore be more readily employed as a mobile marketing service.
Perhaps the defining characteristic of the RS we are describing is the immense size of the underlying databases coupled with the very high degree of volatility these databases undergo. RS client databases will of course vary widely from company to company but they can be transformed into a relatively simple and general schema, or meta-model, based upon the many-to-many relationship between users and items (Figure 18). However, due to the potentially billions of signals per day generated within the mobile advertising ecosystem, traditional dimensional data management models are illequipped to store and manage the large volume of data directed to the predictive models described above. Rather it is necessary to resort to highly parallel and distributed computing techniques.
Our approach adopts a big data solution that uses the following dynamic distributed database technology:
• Apache HadoopTM is an open-source software framework that supports data-intensive distributed applications. Hadoop implements a high degree of parallel computation via Map Reduce framework which divides an application into many small fragments of work, and may be run simultaneously on any computer nodes in the network. Another feature of Hadoop is the Hadoop Distributed File System (HDFS) enabling the distribution of data across the data nodes of the network complementing the distribution of the application. Hadoop enables applications to work in parallel with effectively unlimited numbers of computation-independent computers and extremely large volumes of data.
Figure 17. Automating customer targeting using econometric models.
Figure 18. Simplified user-item and user-attribute database schema for CF-based ReCo.
• Map Reduce is a framework for processing very large data sets using parallel, distributed algorithms across a large number of computer nodes known as a cluster. The overall Map Reduce infrastructure coordinates the distributed servers, managing parallel tasks, communications and data transfers. The key benefits of this framework are scalability and fault-tolerance across a variety of applications achieved by optimizing the execution engine once.
• Pig is a high-level platform for creating Map Reduce programs used with Hadoop. The language for this platform is called Pig Latin which is essentially the Map Reduce equivalent of SQL in the RDBMS world. Pig allows programmers more latitude in designing queries than SQL and is especially useful in the ETL (Extract, Transform and Load) operation which converts data from source databases into the Hive data warehouse infrastructure described below.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop that facilitates the familiar warehouse functions of data drill down and aggregation, standard database queries (via HiveQL), and analysis such as cross-tabulations. The schema shown in Figure 18 would be implemented in Hive in this environment.
Without delving further into the architectural details of the implementation, we can summarize our architecture as consisting of the Hive data warehouse infrastructure, the Map Reduce programming model, and the Pig platform for creating Map Reduce programs to perform ETL operations, data synthesis and feature selection on the very large numbers of data signals stored in log files on a Hadoop Distributed File System (HDFS). This approach allows us to preserve the integrity of the dimensional data model while minimizing the amount of rework necessary in other software components and services.
In addition to the data dimension of an RSG, we must also consider the model management requirements for manipulating the data. Typically analytical models tend to be much less volatile than data. However the systems we are discussing are dynamic data-driven feedback systems requiring that RSGs not only periodically update their underlying databases to align with the associated data sources, but determine an optimal or near-optimal refresh time as well. Secondly, the associated analytical models must adapt quickly to the changes in the environment. Adaptive modeling in this context may require the model to change in near real-time with the data which in turn reflects customer activity in the mobile marketplace. Currently our ReCo recommendation refresh time is 24 hours, but situations demanding more stringent refresh intervals for RS are becoming more prevalent. As we indicate in our discussion of future extensions, this leads to a stronger requirement for automated modeling.
The analytical models in ReCo are straightforward applications of similarity measures and, as such, are fairly simple. However they must be applied every time the databases are refreshed, and in so doing, it may be that a different similarity measure outperforms the others in contrast to previous computations. A simple weighting scheme of the measures applied can be calculated which in turn may have a subtle effect on the resultant recommendation set. Because there was little difference between the three measures used in ReCo, we would expect a negligible effect in our case.
As we indicated in the Introduction, there is a wide array of techniques that can be applied to RS including collaborative filtering, econometric models, and a portfolio of statistical clustering models (see Table 4). These are familiar methods frequently used in data mining so an integral component of an RSG is a model library of these predictive analytics, perhaps in the form of a library of reusable methods.
Display management is a more customized artifact
Table 4. Sample of predictive analytic models for RS model library.
conforming to client requirements and desires. Most RS provide simple ranked displays of recommendations ordered from top to bottom with hot links to the products themselves for user browsing. ReCo is similarly designed with its recommendations shown in scorecards ranked from top to bottom.
We have outlined at a high level some steps we have taken to generalize RS architecture so that specific systems can be developed and deployed rapidly as a mobile marketing service for clients. This entails facilitating a high degree of parallel computation and distributed data management to deal with the very large size and high volatility of the related databases, as well as flexible decision analytics to address the tightly constrained feedback-driven nature of mobile applications.
6. Summary and Contribution
We have set out in this paper to introduce the concept of RS as a valuable marketing service. We have presented a case study of a collaboration filtering-based recommender system, ReCo, for increasing revenues, customer satisfaction and customer loyalty for a large telecommunications carrier. Although a relatively simple system, the lift in purchases and revenues resulting from ReCo in our analysis justifies its value as a marketing service.
We have presented a recommender system architecture used in developing ReCo which we believe that it can be generalized to increase reusability from a software engineering perspective. Unique characteristics of this RS Generator approach are the centrality of automated modeling in combination with dynamic “near real-time” modeling feedback loops. We have indicated how components can be generalized into RSG for quick development of specific applications which can be provided as a market service, especially to companies with large product lines and/or customer base.
To bolster our case, we have additionally suggested how our approach can be tailored within the same framework using advanced adaptive modeling and feedback loops to provide powerful customer targeting services for mobile media advertising.
Then, our major contribution is having shown how data-intensive but analytically simple CF-based ReCos can be engineered into the valuable marketing service instruments, both for RS and for customer targeting purposes in the mobile media world.
Appendix: Similarity Measures Commonly Used in CF RS
Xi and Yi represent the vectors of either users or items being compared.
Euclidean Distance Similarity measures have a lower bound of 0 which would indicate a perfect match with no commensurate upper bound.
The cosine similarity values range between 0 and 1, indicating weak to strong similarity respectively.