A Survey on Modern Recommendation System based on Big Data

This survey provides an exhaustive exploration of the evolution and current state of recommendation systems, which have seen widespread integration in various web applications. It focuses on the advancement of personalized recommendation strategies for online products or services. We categorize recommendation techniques into four primary types: content-based, collaborative filtering-based, knowledge-based, and hybrid-based, each addressing unique scenarios. The survey offers a detailed examination of the historical context and the latest innovative approaches in recommendation systems, particularly those employing big data. Additionally, it identifies and discusses key challenges faced by modern recommendation systems, such as data sparsity, scalability issues, and the need for diversity in recommendations. The survey concludes by highlighting these challenges as potential areas for fruitful future research in the field.

1 Introduction

In this survey, we examine the escalating popularity and diverse application of recommendation systems in web applications, a topic extensively covered by Zhou et al. [1] . These systems, a specialized category of information filtering systems, are designed to predict user preferences for various items. They play a crucial role in guiding decision-making processes, such as purchasing decisions and music selections, as Wang et al. discuss [2] . A prime example of this application is Amazon’s personalized recommendation engine, which tailors each user’s homepage. Major companies like Amazon, YouTube, and Netflix employ these systems to enhance user experience and generate significant revenue, as noted by Adomavicius et al. and Omura et al. [3, 4] . Figure 1 from Entezari et al. [5] illustrates a modern recommendation system. Additionally, these systems are increasingly relevant in the field of human-computer interaction (HCI), where they enhance interaction efficiency through feedback mechanisms, a topic explored in several studies [6, 7, 8, 9] .

Recommendation systems are particularly crucial for certain companies, as their efficiency can lead to substantial revenue generation and competitive advantage, as evidenced in the research by Rismanto et al. and Cui et al. [10, 11] . For instance, Netflix’s “Netflix Prize” challenge aimed to develop a recommender system surpassing their existing algorithm, with a substantial prize to incentivize innovation.

Refer to caption

Furthermore, in the domain of big data, recommendation systems are highly prevalent, as detailed by Li et al. [12, 13] . These systems predict user interests in purchasing based on extensive data analysis, including purchase history, ratings, and reviews. There are four widely recognized types of recommendation systems, as identified by Numnonda [14] : content-based, collaborative filtering-based, knowledge-based, and hybrid-based, each with distinct advantages and drawbacks, as Xiao et al. elucidate [15] . For example, collaborative filtering-based systems may face issues such as data sparsity and scalability, as Huang et al. mention [16] , and cold-start problems, while content-based systems might struggle to diversify user interests, as noted by Zhang et al. and Benouaret et al. [17, 18] .

This paper is organized as follows: Section II provides a comprehensive review of both historical and modern state-of-the-art approaches in recommendation systems, coupled with an in-depth analysis of the latest advancements in the field. Section III discusses the challenges in big data-based recommendation systems, including sparsity, scalability, and diversity, and explores solutions for these challenges. The paper concludes with a summary in Section IV.

2 Recommendation Systems

Recommendation systems aim to predict users’ preferences for a certain item and provide personalized services [19] . This section will discuss several commonly used recommender methods, such as content-based method, collaborative filtering-based method, knowledge-based method, and hybrid-based method.

2.1 Content-based Recommendation Systems

The main idea of content-based recommenders is to recommend items based on the similarity between different users or items [20] . This algorithm determines and differentiates the main common attributes of a particular user’s favorite items by analyzing the descriptions of those items. Then, these preferences are stored in this user’s profile. The algorithm then recommends items with a higher degree of similarity with the user’s profile. Besides, content-based recommendation systems can capture the specific interests of the user and can recommend rare items that are of little interest to other users. However, since the feature representations of items are designed manually to a certain extent, this method requires a lot of domain knowledge. In addition, content-based recommendation systems can only recommend based on users’ existing interests, so the ability to expand users’ existing interests is limited.

Refer to caption

2.2 Collaborative Filtering-based Recommendation Systems

Collaborative Filtering-based (CF) methods are primarily used in big data processing platforms due to their parallelization characteristics [21] . The basic principle of the recommendation system based on collaborative filtering is shown in Fig. 2 [22] . CF recommendation systems use the behavior of a group of users to recommend to other users [23] . There are mainly two types of collaborative filtering techniques, which are user-based and item-based.

User-based CF: In the user-based CF recommendation system, users will receive recommendations of products that similar users like [24] . Many similarity metrics can calculate the similarity between users or items, such as Constrained Pearson Correlation coefficient (CPC), cosine similarity, adjusted cosine similarity, etc. For example, cosine similarity is a measure of similarity between two vectors. Let x 𝑥 x italic_x and y 𝑦 y italic_y denote two vectors, cosine similarity between x 𝑥 x italic_x and y 𝑦 y italic_y can be represented by

Item-based CF: Item-based CF algorithm predicts user ratings for items based on item similarity. Generally, item-based CF yields better results than user-based CF because user-based CF suffers from sparsity and scalability issues. However, both user-based CF and item-based CF may suffer from cold-start problems [25] .

Refer to caption

3.1 Big Data Processing Flow

Big data comes from many sources, and there are many methods to process it [55] . However, the primary processing of big data can be divided into four steps [56] . Besides, Fig. 4 presents the basic flow of big data processing.

Data Collection.

Data Processing and Integration. The collection terminal itself already has a data repository, but it cannot accurately analyze the data. The received information needs to be pre-processed [57] .

Data Analysis. In this process, these initial data are always deeply analyzed using cloud computing technology [58] .

Data Interpretation.

Refer to caption

3.2 Modern Recommendation Systems based on the Big Data

The shortcomings of traditional recommendation systems mainly focus on insufficient scalability and parallelism [59] . For small-scale recommendation tasks, a single desktop computer is sufficient for data mining goals, and many techniques are designed for this type of problems [60] .

Refer to caption

However, the rating data is usually so large for medium-scale recommendation systems that it is impossible to load all the data into memory at once [61] . Common solutions are based on parallel computing or collective mining, sampling and aggregating data from different sources, and using parallel computing programming to perform the mining process [62] . The big data processing framework will rely on cluster computers with high-performance computing platforms [63] . At the same time, data mining tasks will be deployed on a large number of computing nodes (i.e., clusters) by running some parallel programming tools [64] , such as MapReduce [52, 65] . For example, Fig. 5 is the MapReduce in the Recommendation Systems.

In recent years, various big data platforms have emerged [66] . For example, Hadoop and Spark [52] , both developed by the Apache Software Foundation, are widely used open-source frameworks for big data architectures [52, 67] . Each framework contains an extensive ecosystem of open-source technologies that prepare, process, manage and analyze big data sets [68] . For example, Fig. 6 is the ecosystem of Apache Hadoop [69] .

Refer to caption

Hadoop allows users to manage big data sets by enabling a network of computers (or “nodes”) to solve vast and intricate data problems. It is a highly scalable, cost-effective solution that stores and processes structured, semi-structured and unstructured data.

Spark is a data processing engine for big data sets. Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop, and it uses random access memory (RAM) to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot. The following are some benefits of the Spark framework:

It is a unified engine that supports SQL queries, streaming data, machine learning (ML), and graph processing.

It can be 100x faster than Hadoop for smaller workloads via in-memory processing, disk data storage, etc.

It has APIs designed for ease of use when manipulating semi-structured data and transforming data.

Refer to caption

Furthermore, Spark is fully compatible with the Hadoop eco-system and works smoothly with Hadoop Distributed File System (HDFS), Apache Hive, and others. Thus, when the data size is too big for Spark to handle in-memory, Hadoop can help overcome that hurdle via its HDFS functionality. Fig. 7 is a visual example of how Spark and Hadoop can work together. Fig. 8 is the the architecture of the modern recommendation system based on Spark.

Refer to caption

4 Summary

Recommendation systems have become very popular in recent years and are used in various web applications. Modern recommendation systems aim at providing users with personalized recommendations of online products or services. Various recommendation techniques, such as content-based, collaborative filtering-based, knowledge-based, and hybrid-based recommendation systems, have been developed to fulfill the needs in different scenarios.

This paper presents a comprehensive review of historical and recent state-of-the-art recommendation approaches, followed by an in-depth analysis of groundbreaking advances in modern recommendation systems based on big data. Furthermore, this paper reviews the issues faced in modern recommendation systems such as sparsity, scalability, and diversity and illustrates how these challenges can be transformed into prolific future research avenues.

References