Architecture – The YouTube Video Recommendation System

YouTube Recommendation Architecture Diagram To provides users with video suggestions that are “reasonably recent and fresh, as well as diverse and relevant to the user’s recent actions” (Davidson et al. 294), the YouTube recommendation system has two major parts: the first one is the candidate generation network that takes the same input as the recommendation system, and produces output as a subset of videos users are likely to watch. The second part is a ranking network that takes the candidate generation network’s output as its input and generates video recommendations to users as output (Covington et al. 2).

Data collection on videos and users serve as inputs of YouTube’s recommendation system. Video information collected include:

Content data (length, quality of video etc) is inherent in the video stream. For each item, there is also data about the total number of views, rating, number of comments and shares, upload time etc.
Video meta-data (title, description etc) is included by the uploaders but can be incomplete or erroneous.
All recommended videos are displayed with a thumbnail, which is a screengrab from the middle of the video. This helps users decide quickly whether they are interested in a video or not.

User data collected include:

Explicit user activity data is generated when the user does specific activities on the site (rating and liking videos, subscribing to an uploader).
Implicit user activity data (starting to watch a video, watching a large part of it) is generated asynchronously while the user is on the site. This data is partly based on pre-defined categories as watching a large part of a video corresponds to a pre-defined length. Sometimes this data can be incomplete if a user closes her browser before this data is collected.
Video relatedness score is a number that describes how often two videos were co-watched during a given time period (usually 24 hours) (htk.tlu.ee).

Collected video data are transformed via high dimensional embeddings in a fixed vocabulary and serve as inputs of a feedforward neural network (Covington et al. 3). A fixed size of inputs per user increases lives performance (Covington’s video ). Users’ watch is regarded as “variable-length sequence of sparse video IDs” and mapped into “a dense vector” that store all entities via embeddings (Covington et al. 3). Search history is transformed into unigrams and bigrams and get embedded. After averaging embedded search and watch history, “a summarized dense search history” gets generated (Covington et al. 3). Since the process of embedding happens on the deep neural network, continuous and categorical features can be added together (Covington et al. 3). For example, binary-represented features such as user’s gender, logged-in state, and age are concatenated after averaged search and watch tokens; embedded geographic region and device information also concatenated (Covington et al. 3).

The concatenated embeddings of the user and video data from the first layer of Rectified Linear Units(ReLU), a function that processes vectors in averaged embedded video watches and searches tokens into video recommendations tailored to users’ interests (Covington et al. 4). To balance best recommendation results and stay within serving CPU budget, 3 more layers of ReLU get to process embedded user and video data; the number of units gets halved in each layer of ReLU (Covington et al. 6). The processed data information is distributed into training data and serving data, in which training data helps refining results in serving data. Serving as the output layer of neural network, softmax processes training data as a generalized version of logistic regression. The output is a multi-nominal distribution over video vocabularies that displays the probability of each component (Covington et al. 4). This data is sent to serving data, nearest neighbor index generates N videos that are closest to the videos/users we try to send recommendations to.

The Ranking subsystem shares similar architecture with Candidate Generation Subsystem, expect (Covington et al. 5):

More video features that are not considered in candidate generation system appear as inputs of embeddings, such as normalized values on time since last watch and number of previous impressions (Covington et al. 6).
Weighted logistic under cross-entropy loss is used in training data to calculate “closely estimate expected watch time” (Covington et al.7).

A linear combination of video quality, user specificity, and diversification are calculated through a weighted logistic regression, which is used to generate expected watch time. This determines videos recommended to users, with the videos with higher estimated watch times coming first.

References

Covington et al. “Deep Neural Networks for YouTube Recommendations.” Proceedings of the 10th ACM Conference on Recommender Systems, ACM, New York, NY, USA (2016).

Davidson et al. “The YouTube Video Recommendation System.” Proceedings of the Fourth ACM Conference on Recommender Systems, ACM, 2010.

Lember, Ulleli. “YouTube-Recommender System.” 16 October 2014. http://htk.tlu.ee/recommendersystems/index.php/Youtube#Data_collected

Sage, Nathaniel Le. “What does it mean to say that a neural network ‘isn’t converging’?”Quora. Jul 12, 2016.https://www.quora.com/What-does-it- mean-to-say-that-a-neural-network-isnt-converging