协同推荐SlopeOne 算法--猜你喜欢-C++技术网

一：概念

SlopeOne的思想很简单，就是用均值化的思想来掩盖个体的打分差异，举个例子说明一下：

在这个图中，系统该如何计算“王五“对”电冰箱“的打分值呢？刚才我们也说了，slopeone是采用均值化的思想,也就是：R_王五 =4-{[(5-10)+(4-5)]/2}=7 。

下面我们看看多于两项的商品，如何计算打分值。

r_b = (n * (r_a - R(_A->B)) + m * (r_c - R(_C->B)))/(m+n)

注意： a,b,c 代表“商品”。

r_a 代表“商品的打分值”。

r_a->b 代表“A组到B组的平均差（均值化）”。

m,n 代表人数。

根据公式，我们来算一下。

r_王五 = (2 * (4 - R(洗衣机->彩电)) + 2 * (10 - R(电冰箱->彩电))+ 2 * (5 - R(空调->彩电)))/(2+2+2)=6.8

是的，slopeOne就是这么简单，实战效果非常不错。

由于思想是如此简单，故我们就来实践一把，当然这里就是最最朴素的实现，只是为了检测下算法效果如何。。。数据集还是如上篇博客一样，用的是movielens里面的小数据集，其中有1000用户对2000物品的评分，80%用来训练，20%用来测试。

具体代码如下：

#include <iostream>
#include <string>
#include <fstream>
#include <math.h>
using namespace std;
const int USERMAX = 1000;
const int ITEMMAX = 2000;
double rating[USERMAX][ITEMMAX];
int I[USERMAX][ITEMMAX];//indicate if the item is rated
double mean;

double predict(int u, int l)
{
	double total = 0;
	double totalCnt = 0;
	for (int i = 0; i < ITEMMAX; i++)
	{
		if (l != i&&I[u][i])
		{
			double dev = 0;
			int cnt = 0;
			for (int j = 0; j < USERMAX; j++)
			{
				if (I[j][l] && I[j][i])
				{
					dev += rating[j][i]-rating[j][l];
					cnt++;
				}
			}
			if (cnt)
			{
				dev /= cnt;
				total += (rating[u][i] - dev)*cnt;
				totalCnt += cnt;
			}
		}
	}
	if (totalCnt == 0)
		return mean;
	return total / totalCnt;
}
double calMean()
{
	double total = 0;
	int cnt = 0;
	for (int i = 0; i < USERMAX; i++)
		for (int j = 0; j < ITEMMAX; j++)
		{
			total += I[i][j] * rating[i][j];
			cnt += I[i][j];
		}
	return total / cnt;
}

void train()
{
	//read rating matrix
	memset(rating, 0, sizeof(rating));
	memset(I, 0, sizeof(I));
	ifstream in("ua.base");
	if (!in)
	{
		cout << "file not exist" << endl;
		exit(1);
	}
	int userId, itemId, rate;
	string timeStamp;
	while (in >> userId >> itemId >> rate >> timeStamp)
	{
		rating[userId][itemId] = rate;
		I[userId][itemId] = 1;
	}	
	mean = calMean();
}

void test()
{
	ifstream in("ua.test");
	if (!in)
	{
		cout << "file not exist" << endl;
		exit(1);
	}
	int userId, itemId, rate;
	string timeStamp;
	double total = 0;
	double cnt = 0;
	while (in >> userId >> itemId >> rate >> timeStamp)
	{
		double r = predict(userId, itemId);
		cout << "true: " << rate << " predict: " << r << endl;
		total += (r - rate)*(r - rate);
		cnt += 1;
		//cout << total << endl;
	}
	cout << "test rmse is " << pow(total / cnt, 0.5) << endl;
}
int main()
{
	train();
	test();
	return 0;
}

在测试集上的rmse达到了0.96，而之前一篇博客实现的svd通过复杂的梯度下降来求最优解也就0.95左右，故SlopeOne算法是非常简单有效的，维基百科里说是最简洁的协同过滤了，但是我个人觉得类似knn的协同过滤更加好懂啊（只不过在计算用户相似度等方面麻烦了点）