二 关联规则挖掘-- Apriori 算法( 二 )


二  关联规则挖掘-- Apriori 算法

文章插图
2.R语言实战
本次实战采用包中的数据集 , 该数据集来自现实生活中某超市的一个月经营数据 。
(1)了解数据集
> #下载并加载包> install.packages("arules")> library("arules")> #调用数据集 , 并查看summary> data(Groceries)> summary(Groceries)transactions as itemMatrix in sparse format with9835 rows (elements/itemsets/transactions) and169 columns (items) and a density of 0.02609146 most frequent items:whole milk other vegetablesrolls/bunssodayogurt 25131903180917151372 (Other) 34055 element (itemset/transaction) length distribution:sizes1234567891011121314151617 2159 1643 1299 10058556455454383502461821177877554629 181920212223242627282932 141491146111131 Min. 1st Qu.MedianMean 3rd Qu.Max. 1.0002.0003.0004.4096.00032.000 includes extended item information - examples:labelslevel2level11 frankfurter sausage meat and sausage2sausage sausage meat and sausage3liver loaf sausage meat and sausage
()的解释:
第一段:该数据集9835条交易记录 , 169种交易商品 , 稀疏矩阵的密度为0. , 即所有购物篮的商品总数量为9835*169*0.=43367 。
第二段:出现频率最高的商品为whole milk:2513次 , other :1903次等 。
第三段:购物篮里商品数量 , 其中只买了一件商品的订单有2159个 , 购物篮里商品最多的有32件商品 。
第四段:购物篮里商品的五数总括和平均数 。
第五段:数据集除了商品名称 , 还包括其他信息 , 在这里是商品所属类别和 , 是小类 , 是大类 。
> #进一步查看数据集信息> inspect(Groceries[1:5])items[1] {citrus fruit,semi-finished bread,margarine,ready soups}[2] {tropical fruit,yogurt,coffee}[3] {whole milk}[4] {pip fruit,yogurt,cream cheese ,meat spreads}[5] {other vegetables,whole milk,condensed milk,long life bakery product}> basketsize <- size(Groceries)> itemfreq <- itemFrequency(Groceries)
解释:
size函数和函数都是包中的函数 , 前者是为了计算购物篮里商品数量 , 后者是为了计算每种商品的支持度 。而则可以画出条形图进行展现 , 如下:
【二关联规则挖掘-- Apriori 算法】> itemFrequencyPlot(Groceries, support = 0.1)
> itemFrequencyPlot(Groceries, topN = 10)
(2)关联规则挖掘
为了进行关联规则挖掘 , 第一步要根据具体的业务知识设定最小支持度 。根据日订单9835/30=328 , 时订单328/12=27 , 可知该超市为一个中型超市 , 我们假设最小支持度为某商品每天至少被购买两次即2*30/9835=0.006 。最小置信度暂定为0.25 。
> #提取关联规则> rules <- apriori(Groceries,parameter = list(support = 0.006, confidence = 0.25, minlen = 2))AprioriParameter specification:confidence minval smax aremaval originalSupport maxtime support minlen maxlen target0.250.11 none FALSETRUE50.006210rulesextFALSEAlgorithmic control:filter tree heap memopt load sort verbose0.1 TRUE TRUEFALSE TRUE2TRUEAbsolute minimum support count: 59 set item appearances ...[0 item(s)] done [0.00s].set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].sorting and recoding items ... [109 item(s)] done [0.00s].creating transaction tree ... done [0.00s].checking subsets of size 1 2 3 4 done [0.00s].writing ... [463 rule(s)] done [0.00s].creating S4 object... done [0.00s].> summary(rules)set of 463 rulesrule length distribution (lhs + rhs):sizes234 150 29716 Min. 1st Qu.MedianMean 3rd Qu.Max. 2.0002.0003.0002.7113.0004.000 summary of quality measures:supportconfidenceliftcountMin.:0.006101Min.:0.2500Min.:0.9932Min.: 60.01st Qu.:0.0071171st Qu.:0.29711st Qu.:1.62291st Qu.: 70.0Median :0.008744Median :0.3554Median :1.9332Median : 86.0Mean:0.011539Mean:0.3786Mean:2.0351Mean:113.53rd Qu.:0.0123033rd Qu.:0.44953rd Qu.:2.35653rd Qu.:121.0Max.:0.074835Max.:0.6600Max.:3.9565Max.:736.0mining info:data ntransactions support confidenceGroceries98350.0060.25