小白能看懂的笔记手册 构建第一个推荐引擎

一、准备
本地环境MAC OS
1.1、安装R语言和
Step 1 安装R语言
安装包下载地址:
点击下载免费版(免费够用) , 下图红框标出 。
下载之前要看下版本信息是否符合否则会容易出问题 。我这里的版本是10.15所以满足 , 你在下载的时候要看清自己的版本哦哦 。
安装完毕打开,如下图:
二、构建基础推荐引擎2.1、载入并格式化数据
下载数据地址:
数据分为=用户名称、title=电影名称、=评分,下面是6名用户 。
,title,
Jack ,Lady in the Water,3.0
Jack , on a Plane,4.0
Jack ,You Me and ,3.5
Jack , ,5.0
Jack ,The Night ,3.0
Mick ,Lady in the Water,3.0
Mick , on a Plane,4.0
Mick ,Just My Luck,2.0
Mick , ,3.0
Mick ,You Me and ,2.0
Mick ,The Night ,3.0
Puig, on a Plane,3.5
Puig,Just My Luck,3.0
Puig,You Me and ,2.5
Puig, ,4.0

小白能看懂的笔记手册  构建第一个推荐引擎

文章插图
Puig,The Night ,4.5
Lisa Rose,Lady in the Water,2.5
Lisa Rose, on a Plane,3.5
Lisa Rose,Just My Luck,3.0
Lisa Rose, ,3.5
Lisa Rose,The Night ,3.0
Lisa Rose,You Me and ,2.5
Toby, on a Plane,4.5
Toby, ,4.0
Toby,You Me and ,1.0
Gene ,Lady in the Water,3.0
Gene , on a Plane,3.5
Gene ,Just My Luck,1.5
Gene , ,5.0
Gene ,You Me and ,3.5
Gene ,The Night ,3.0
我们的目标是构建一个推荐引擎,用来基于相似用户的评级推荐他们还没有看过的电影 。这里的move.csv就是刚才我们提到的上面的数据
安装我们需要的的包,这个包为数据结构化工具包
我们可以通过如下命令来获取结构化的数据
> movie_ratings = as.data.frame(acast(ratings,title~critic,value.var="rating")) > View(movie_ratings)
结构化的数据展示如下图
那么我们现在的目标就是基于用户的相似度给用户推荐他们没有评级过的电影 。比如:想要给Toby推荐电影时候,会基于与Toby相似的用户提供的评级 。
2.2、计算用户相似度
> sim_users=cor(movie_ratings[,1:6],use="complete.obs")> View(sim_users)
基于用户矩阵求用户相似度以后得出如下图 , 红色框的两个用户Toby和比较接近相似度0.99最高,然后与Toby相似度为0.92.
2.3、为用户预测未知评级
小白能看懂的笔记手册  构建第一个推荐引擎

文章插图
我们目的是将相似用户给出的评级为Toby推荐他没有评级过的电影 。有如下几个步骤:
抽取出Toby没有评级过的电影找到所有其他用户给这些电影的评级将所有用户(不包括Toby自己)对第一步中找到的电影的评级乘以与Toby的相似度计算每一步电影作品的评级和,并除以所有相似度之和 。
编码前我们要熟悉一下用到的data.table包和setDT()函数 。可以切分数据级、操作数据、合并等高级操作 。
2.3.1、 我们要抽取Toby没有评级过的电影
这里需要注意的是安装包以后还需要载入包,才可以使用包中的函数 。
install.packages("data.table")试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/data.table_1.14.0.tgz'Content type 'application/x-gzip' length 2338988 bytes (2.2 MB)==================================================downloaded 2.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]Error in setDT(movie_ratings[colnames(movie_ratings)[6]], keep.rownames = TRUE) : 没有"setDT"这个函数> library(data.table)data.table 1.14.0 using 1 threads (see ?getDTthreads).Latest news: r-datatable.com**********用中文运行data.table 。软件包只提供英语支持 。当在在线搜索帮助时,也要确保检查英语错误信息 。这个可以通过查看软件包源文件中的po/R-zh_CN.po和po/zh_CN.po文件获得,这个文件可以并排找到母语和英语错误信息 。********************This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.**********载入程辑包:‘data.table’The following objects are masked from ‘package:reshape2’:dcast, melt> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]>
进行电影评级筛选
> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]> names(rating_critic)=c('title','rating')> View(rating_critic)
从面列表中筛选未评级的电影 。is_na()函数用于筛选出NA值
> titles_na_critic=rating_critic$title[is.na(rating_critic$rating)]> titles_na_critic[1] "Just My Luck""Lady in the Water""The Night Listener"
2.3.2、 根据评级过上述电影的用户原始数据集和子集获取评级
> ratings_t=ratings[ratings$title %in% titles_na_critic,]> View(ratings_t)
增加一个相似度
> x=(setDT(data.frame(sim_users[,6]),keep.rownames = TRUE)[])> names(x)=c('critic','similarity')> ratings_t=merge(x=ratings_t,y=x,by="critic",all.x = TRUE)> View(ratings_t)
2.3.3、将相似度值乘以评级,结果作为新的变量:
> ratings_t$sim_rating=ratings_t$rating*ratings_t$similarity> ratings_tcritictitle rating similarity sim_rating1Claudia PuigJust My Luck3.00.89340512.68021542Claudia Puig The Night Listener4.50.89340514.02032323Gene SeymourLady in the Water3.00.38124641.14373934Gene SeymourJust My Luck1.50.38124640.57186965Gene Seymour The Night Listener3.00.38124641.14373936Jack MatthewsLady in the Water3.00.66284901.98854697Jack Matthews The Night Listener3.00.66284901.98854698Lisa RoseLady in the Water2.50.99124072.47810189Lisa RoseJust My Luck3.00.99124072.973722110Lisa Rose The Night Listener3.00.99124072.973722111Mick LaSalleLady in the Water3.00.92447352.773420412Mick LaSalleJust My Luck2.00.92447351.848946913Mick LaSalle The Night Listener3.00.92447352.7734204
2.3.4、计算每一步电影作品的评级和,并除以所有相似度之和 。
对前面步骤中计算出的每部电影所有的进行累加,然后用每部电影的累加的值除以该电影用户相似度的累加值 。也就是比如Just My Luck这部电影,Toby的预测评级是通过将评级过Just My Luck的所有值相加,然后除以其与所有这些用户的相似度值的总和来计算的:
> install.packages("dplyr")试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/dplyr_1.0.5.tgz'Content type 'application/x-gzip' length 1251016 bytes (1.2 MB)==================================================downloaded 1.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> library(dplyr)载入程辑包:‘dplyr’The following objects are masked from ‘package:data.table’:between, first, lastThe following objects are masked from ‘package:stats’:filter, lagThe following objects are masked from ‘package:base’:intersect, setdiff, setequal, union> result = ratings_t %>% group_by(title) %>% summarise(sum(sim_rating)/sum(similarity))> result# A tibble: 3 x 2title`sum(sim_rating)/sum(similarity)`1 Just My Luck2.532 Lady in the Water2.833 The Night Listener3.35> mean(rating_critic$rating,na.rm = T)[1] 3.166667
整个搭建代码
> ratings = read.csv("~/Documents/知识库/推荐引擎/movie.csv")> head(ratings)critictitle rating1 Jack MatthewsLady in the Water3.02 Jack MatthewsSnakes on a Plane4.03 Jack MatthewsYou Me and Dupree3.54 Jack MatthewsSuperman Returns5.05 Jack Matthews The Night Listener3.06Mick LaSalleLady in the Water3.0> dim(ratings)[1] 313> Str(ratings)Error in Str(ratings) : 没有"Str"这个函数> str(ratings)'data.frame': 31 obs. of3 variables:$ critic: chr"Jack Matthews" "Jack Matthews" "Jack Matthews" "Jack Matthews" ...$ title : chr"Lady in the Water" "Snakes on a Plane" "You Me and Dupree" "Superman Returns" ...$ rating: num3 4 3.5 5 3 3 4 2 3 2 ...> levels(ratings$critic)NULL> levels(ratings$title)NULL> as.data.frame(acast(ratings,title~critic,value.var="ratings"))Error in acast(ratings, title ~ critic, value.var = "ratings") : 没有"acast"这个函数> as.data.frame(cast(ratings,title~critic,value.var="ratings"))Error in cast(ratings, title ~ critic, value.var = "ratings") : 没有"cast"这个函数> library(reshape2)Error in library(reshape2) : 不存在叫‘reshape2’这个名字的程辑包> install.packages("dplyr")also installing the dependencies ‘assertthat’, ‘cli’, ‘crayon’, ‘utf8’, ‘fansi’, ‘pillar’, ‘pkgconfig’, ‘purrr’, ‘digest’, ‘ellipsis’, ‘generics’, ‘glue’, ‘lifecycle’, ‘magrittr’, ‘R6’, ‘rlang’, ‘tibble’, ‘tidyselect’, ‘vctrs’试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/assertthat_0.2.1.tgz'Content type 'application/x-gzip' length 52572 bytes (51 KB)==================================================downloaded 51 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/cli_2.3.1.tgz'Content type 'application/x-gzip' length 469798 bytes (458 KB)==================================================downloaded 458 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/crayon_1.4.1.tgz'Content type 'application/x-gzip' length 139916 bytes (136 KB)==================================================downloaded 136 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/utf8_1.1.4.tgz'Content type 'application/x-gzip' length 195526 bytes (190 KB)==================================================downloaded 190 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/fansi_0.4.2.tgz'Content type 'application/x-gzip' length 212149 bytes (207 KB)==================================================downloaded 207 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/pillar_1.5.1.tgz'Content type 'application/x-gzip' length 951068 bytes (928 KB)==================================================downloaded 928 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/pkgconfig_2.0.3.tgz'Content type 'application/x-gzip' length 17738 bytes (17 KB)==================================================downloaded 17 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/purrr_0.3.4.tgz'Content type 'application/x-gzip' length 417900 bytes (408 KB)==================================================downloaded 408 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/digest_0.6.27.tgz'Content type 'application/x-gzip' length 300368 bytes (293 KB)==================================================downloaded 293 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/ellipsis_0.3.1.tgz'Content type 'application/x-gzip' length 33497 bytes (32 KB)==================================================downloaded 32 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/generics_0.1.0.tgz'Content type 'application/x-gzip' length 69334 bytes (67 KB)==================================================downloaded 67 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/glue_1.4.2.tgz'Content type 'application/x-gzip' length 139018 bytes (135 KB)==================================================downloaded 135 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/lifecycle_1.0.0.tgz'Content type 'application/x-gzip' length 93309 bytes (91 KB)==================================================downloaded 91 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/magrittr_2.0.1.tgz'Content type 'application/x-gzip' length 224854 bytes (219 KB)==================================================downloaded 219 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/R6_2.5.0.tgz'Content type 'application/x-gzip' length 82447 bytes (80 KB)==================================================downloaded 80 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/rlang_0.4.10.tgz'Content type 'application/x-gzip' length 1327903 bytes (1.3 MB)==================================================downloaded 1.3 MB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/tibble_3.1.0.tgz'Content type 'application/x-gzip' length 803558 bytes (784 KB)==================================================downloaded 784 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/tidyselect_1.1.0.tgz'Content type 'application/x-gzip' length 197492 bytes (192 KB)==================================================downloaded 192 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/vctrs_0.3.6.tgz'Content type 'application/x-gzip' length 1403267 bytes (1.3 MB)==================================================downloaded 1.3 MB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/dplyr_1.0.5.tgz'Content type 'application/x-gzip' length 1251016 bytes (1.2 MB)==================================================downloaded 1.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> library(reshape2)Error in library(reshape2) : 不存在叫‘reshape2’这个名字的程辑包> install.packages("dplyr")试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/dplyr_1.0.5.tgz'Content type 'application/x-gzip' length 1251016 bytes (1.2 MB)==================================================downloaded 1.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> install.packages("reshape")also installing the dependencies ‘Rcpp’, ‘plyr’试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/Rcpp_1.0.6.tgz'Content type 'application/x-gzip' length 3203922 bytes (3.1 MB)==================================================downloaded 3.1 MB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/plyr_1.8.6.tgz'Content type 'application/x-gzip' length 1012642 bytes (988 KB)==================================================downloaded 988 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/reshape_0.8.8.tgz'Content type 'application/x-gzip' length 168563 bytes (164 KB)==================================================downloaded 164 KBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> install.packages("reshape2")also installing the dependencies ‘stringi’, ‘stringr’试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/stringi_1.5.3.tgz'Content type 'application/x-gzip' length 13641892 bytes (13.0 MB)==================================================downloaded 13.0 MB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/stringr_1.4.0.tgz'Content type 'application/x-gzip' length 210650 bytes (205 KB)==================================================downloaded 205 KB试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/reshape2_1.4.4.tgz'Content type 'application/x-gzip' length 332035 bytes (324 KB)==================================================downloaded 324 KBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> library(reshape2)> as.data.frame(acast(ratings,title~critic,value.var="ratings"))错误: value.var (ratings) not found in input> ratings = read.csv("~/Documents/知识库/推荐引擎/movie.csv")> as.data.frame(acast(ratings,title~critic,value.var="ratings"))错误: value.var (ratings) not found in input> as.data.frame(acast(ratings,title~critic,value.var="rating"))Claudia Puig Gene Seymour Jack Matthews Lisa Rose Mick LaSalleJust My Luck3.01.5NA3.02Lady in the WaterNA3.03.02.53Snakes on a Plane3.53.54.03.54Superman Returns4.05.05.03.53The Night Listener4.53.03.03.03You Me and Dupree2.53.53.52.52TobyJust My LuckNALady in the WaterNASnakes on a Plane4.5Superman Returns4.0The Night ListenerNAYou Me and Dupree1.0> movie_ratings = as.data.frame(acast(ratings,title~critic,value.var="rating")) > View(movie_ratings)> sim_users=cor(movie_ratings[,1:6],use="complete.obs")> View(sim_users)> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]Error in setDT(movie_ratings[colnames(movie_ratings)[6]], keep.rownames = TRUE) : 没有"setDT"这个函数> install.packages("data.table")试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/data.table_1.14.0.tgz'Content type 'application/x-gzip' length 2338988 bytes (2.2 MB)==================================================downloaded 2.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]Error in setDT(movie_ratings[colnames(movie_ratings)[6]], keep.rownames = TRUE) : 没有"setDT"这个函数> library(data.table)data.table 1.14.0 using 1 threads (see ?getDTthreads).Latest news: r-datatable.com**********用中文运行data.table 。软件包只提供英语支持 。当在在线搜索帮助时,也要确保检查英语错误信息 。这个可以通过查看软件包源文件中的po/R-zh_CN.po和po/zh_CN.po文件获得 , 这个文件可以并排找到母语和英语错误信息 。********************This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.**********载入程辑包:‘data.table’The following objects are masked from ‘package:reshape2’:dcast, melt> rating_critic=setDT(movie_ratings[colnames(movie_ratings)[6]],keep.rownames=TRUE)[]> names(rating_critic)=c('title','rating')> View(rating_critic)> titles_na_critic=rating_critic$title[is.na(rating_critic$rating)]> titles_na_critic[1] "Just My Luck""Lady in the Water""The Night Listener"> ratings_t=ratings[ratings$title %in% titles_na_critic,]> View(ratings_t)> x=(setDT(data.frame(sim_users[,6]),keep.rownames = TRUE)[])> names(x)=c('critic','similarity')> ratings_t=merge(x=ratings_t,y=x,by="critic",all.x = TRUE)> View(ratings_t)> ratings_t$sim_rating=ratings_t$rating*ratings_t$similarity> ratings_tcritictitle rating similarity sim_rating1Claudia PuigJust My Luck3.00.89340512.68021542Claudia Puig The Night Listener4.50.89340514.02032323Gene SeymourLady in the Water3.00.38124641.14373934Gene SeymourJust My Luck1.50.38124640.57186965Gene Seymour The Night Listener3.00.38124641.14373936Jack MatthewsLady in the Water3.00.66284901.98854697Jack Matthews The Night Listener3.00.66284901.98854698Lisa RoseLady in the Water2.50.99124072.47810189Lisa RoseJust My Luck3.00.99124072.973722110Lisa Rose The Night Listener3.00.99124072.973722111Mick LaSalleLady in the Water3.00.92447352.773420412Mick LaSalleJust My Luck2.00.92447351.848946913Mick LaSalle The Night Listener3.00.92447352.7734204> install.packages("dply")Warning in install.packages :package ‘dply’ is not available for this version of RA version of this package for your version of R might be available elsewhere,see the ideas athttps://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages> install.packages("dplyr")试开URL’https://cran.rstudio.com/bin/macosx/contrib/4.0/dplyr_1.0.5.tgz'Content type 'application/x-gzip' length 1251016 bytes (1.2 MB)==================================================downloaded 1.2 MBThe downloaded binary packages are in/var/folders/2w/tt1p_4td3yq9xlbl7c2t4jn00000gn/T//Rtmp8jyvFD/downloaded_packages> library(dplyr)载入程辑包:‘dplyr’The following objects are masked from ‘package:data.table’:between, first, lastThe following objects are masked from ‘package:stats’:filter, lagThe following objects are masked from ‘package:base’:intersect, setdiff, setequal, union> result = ratings_t %>% group_by(title) %>% summarise(sum(sim_rating)/sum(similarity))> result# A tibble: 3 x 2title`sum(sim_rating)/sum(similarity)`1 Just My Luck2.532 Lady in the Water2.833 The Night Listener3.35> mean(rating_critic$rating,na.rm = T)[1] 3.166667
小型推荐系统函数
generateRecommendation <- function(userId){rating_critic=setDT(movie_ratings[colnames(movie_ratings)[userId]],keep.rownames=TRUE)[]names(rating_critic)=c('title','rating')titles_na_critic=rating_critic$title[is.na(rating_critic$rating)]ratings_t=ratings[ratings$title %in% titles_na_critic,]#每个用户添加一个新的相似变量x=(setDT(data.frame(sim_users[,6]),keep.rownames = TRUE)[])names(x)=c('critic','similarity')ratings_t=merge(x=ratings_t,y=x,by="critic",all.x = TRUE)#多个相似值ratings_t$sim_rating=ratings_t$rating*ratings_t$similarity#预测费评级的电影名称result = ratings_t %>% group_by(title) %>% summarise(sum(sim_rating)/sum(similarity))return (result)}
【小白能看懂的笔记手册构建第一个推荐引擎】根据不同的用户ID进行不同的电影预测和推荐 。