Okay denizens of the internet, I need you to do me a favor. It won’t take much on your part, but we can’t solve this problem unless we work together.
The thing is: y’all broke the rating and review systems on the internet.
Sure, reviews and ratings mechanisms are up and running; they’re doing their job in tabulating the number of stars people click and more or less making sure that they can’t vote twice. Technically speaking, there’s nothing wrong with the rating systems themselves. The problem is how we’re using them. 10/10 or five stars out of five doesn’t mean anything anymore.
Look, don’t give up on me just yet. This is better explained by going through an example scenario. Let’s imagine that we’re trying to find a new video game to play.
We’ve got some time to kill while we’re stuck here in our homes for the rest of eternity (No way out! No way out!) and we’ve just about had it with playing through Dead Space.¹ We could go online and start reading the reviews for every game that’s come out in the last month or two, but thats’ going to take forever. Instead, it would be smarter for us to weed out some games based on the simple metric of how they were rated — 1–5 or 1–10 stars, or 1/10 or 1/100; whatever numerical or graphical system there is. From there we can choose a few that look promising and invest the time in hearing what people actually have to say about them.
We scroll through the list of games and immediately start to run into a bit of a problem though. It seems like something weird has happened and the universe of video games is now populated only by incredible titles that people think are 9–10 out of 10, and atrocious ones that get 0–2 points. What’s going on here?
How is it possible that video games are either 10/10 perfect, or 1/10 “worst thing ever”? The same is true of Amazon products, or Netflix movies, or app ratings on the app stores.
Just going by the numbers available on rating and review systems, one would have to conclude that we live in a world where things are either terrific or horrible.² Nothing is just “good” or “okay” or “fine” anymore, and that’s troubling because it means we have no way of telling which things are better than the others. Did everyone really really like Red Dead 3, or did they really really really like it? Was it truly 10/10 best game they’ve ever played — because that would be difficult to believe considering their review history shows that a lot of other games they’ve played got 10/10 too. Surely every game can’t be the best game you’ve ever played?
Now, at this point I hear what you’re thinking. “Dacoda, you pretentious, nit-picking, pedantic, nincompoop; just get with the program. If you like it you give it all the stars, if you don’t like it you give it as few stars as possible. It’s so easy!”
That doesn’t work though. If you’re using a system with 5 or 10 degrees of good-vs-badness then only making use of two of the options destroys its usefulness. It turns everything into a binary choice between GOOD or BAD, with no indication of which good things are BETTER or which bad things are WORSE than any of the others.
This is similar to the really unfortunate trend of grade inflation in education. A “C” grade is nothing to be embarrassed about — it literally means that you’ve done as well as just about everybody else, not exceptional, and not terrible. Society had different ideas though and for some reason a C became a “bad” grade. In people’s minds the grading system was reduced from A at the top and F at the bottom, to merely A or F.
What this has led to is a world where a straight-A student isn’t a super-genius, they’re just someone who’s done their work adequately. If you turn in good papers and do all your homework and show up to class then, as long as you also get 90% on the midterm and final exams, you’ll probably walk away with an A.
How then do we determine which students were like, wildly intelligent? If an acceptable paper on Abraham Lincoln gets an A, but a phenomenally researched paper analyzing the absurdity of painting Lincoln as a champion of equality also gets an A, how is anyone supposed to distinguish between “fine” and “exceptional”?
This plays out in online ratings today and it’s equally problematic. If I buy two pairs of headphones to test them out then it’s almost guaranteed I’m going to like one of them more than the other. Giving both of them five stars doesn’t begin to convey that information to other purchasers who are seeking a recommendation. All this tells people is that both pairs of headphones work, because if they didn’t they’d get one star, but it doesn’t say which pair sounds better, is more comfortable, etc.
There are two ways of looking at humanity for giving out ratings in this way:
1) People are too dumb to know anything about standard deviation and normal distributions.
2) People are fundamentally good and don’t want to hurt anybody else’s feelings.
For a long time, I have to admit that I thought #1 was the problem. As I’ve pondered it more over the years though, I’ve come to conclude that, actually, it’s probably more #2 than anything else. Everyone realizes that the star ratings aren’t that helpful on some level, they just don’t want to be mean — although I will admit that sentiment seems very strange to me considering that being mean online is more or less a major part of internet culture.
The way to resolve this problem is that we’re all going to have to take a hard look at ourselves and admit something that is a major taboo: it’s perfectly acceptable to be just … okay.
I’ll say that again: we need to accept that it is alright to be okay.
It’s not a popular thing to say, but we can’t all be the best at everything all the time. In the same way, not every book, video game, movie, T-shirt, pencil-sharpener, whatever-else-Amazon-sells, etc. can be the best of all time either.
Instead, please allow me to take a few minutes to explain how we should be rating and reviewing things on the internet, because if we can all work together to make this happen, we will all benefit. Think of how useful it will be to be able to glean useful information from those metrics, instead of just using them as a gauge of whether something might be worth buying (4.5 to 5 stars) or whether it’s definitely a scam product that will torture your pets and give you acne (anything less than 4.5 stars).
The mindset when choosing how to rate an item should be one of retrospective as well as contemporaneous examination. To put it another way, what did you think before and after consuming or using the product?
When shopping for a thing or selecting a movie to watch or song to listen to, one should ideally have an idea of what they are expecting. For this item, at this price, and given what I know about it from its description and things I’ve heard, what would would be an acceptable outcome?
Let’s take the example of a pair of headphones, because this also illustrates how the approach I’m suggesting works well with subjective things like personal taste as well. When shopping online I’m going to be looking at the prices and the features and all that good stuff and I’m going to have some vague idea of how “good” I expect audio to sound based on both how much I’m paying for them, and what sort they are (over-the-ear, in-ear, sport, etc.). That expectation is the baseline that I will use to decide on my rating.
When the headphones come, I’ll plug them in and start listening to some audio. There are three archetypes for the possible outcomes:³
1.) Audio quality is worse than I expected — I am disappointed.
2.) Audio quality is basically exactly what I thought it would be.
3.) Audio quality is better than I expected — I am pleasantly surprised.
If we are using a three-star rating system, or if the numerical options are the numbers 1, 2, or 3, then we’ve also already got our possible ratings. If they’re worse than I was expecting then they get a one. If they were what I expected, they get a two, and if they were better than I expected by some amount that I think is worth mentioning then they get a three.
What this means is that when I give my rating out people can learn things from it. True, what I expect of a pair of headphones at a particular price versus what others expect of them at that price are likely slightly different. Some people want stronger bass, others are more interested in response times, etc. This works itself out over time when averaging over a large number of reviews though: the general consensus will be whether the headphones are as good as, better, or worse than what the majority of people expected.
However, if we use ratings the way they seem to be most commonly used, I would presumably give the headphones three stars even if they only fell within the range of quality that I’d been expecting. I don’t want to hurt the seller’s feelings or make them look bad by giving them only two stars!
But again, that forces us to ask the question: what’s wrong with two stars? The meaning of two stars is that the headphones are fine. You got what you paid for. Wouldn’t it be terrific if instead of thinking that a three-star rating meant that the headphones will either be fine or exceptional, we could assume that if most people give them a three-star rating they’re actually great value for your money? That’s more informative! From that information people can look for the best products, rather than all products which are acceptable or better, never knowing where they fall on that scale.
We can expand this even more. The common 5-star rating system gives us another degree of helpfulness. Here are the new options for what could happen:
1.) The headphones sound much worse than I expected.
2.) The headphones sound notably worse than I expected.
3.) The headphones sound like what I expected.
4.) The headphones sound notably better than I expected.
5.) The headphones sound much better than I expected.
This is even more informative than the three-star system because now I can express a certain amount of how much better or worse things are, rather than just whether they were better or worse. It provides a lot more data for people who are looking for a product because now they can see, to a small degree, how much people like or dislike something.
This is how we need to go about rating things, and it’s how rating systems are supposed to work. In statistics there is a concept known as a bell-curve and the idea is that if you take a lot of samples of things, you should wind up with a curve that looks like an inverted bell. Things that were the most common wind up in the middle, and things that were higher or lower (better or worse) happen less frequently and to the right or left, respectively, of the center.
Getting an A should be something that happens not very frequently, because it’s meant to represent someone doing something much better than average. A B should happen more often, but still not that common, because it’s meant to show that a person did better than most. Most people should get a C though, because most people perform about the same as others, broadly speaking.
There is nothing wrong with getting a C, and there shouldn’t be anything wrong with getting a 3 out of 5 star rating, or a 5 out of 10 score, or 50 out of 100 on Metacritic. Those ratings mean that the product performed fine. It was okay. It worked and it did what it was supposed to, or it entertained you but it isn’t something you’d write a rave review about. It’s just … fine, and there’s nothing wrong with that.
We should be working to communicate the quality of items using rating systems, rather than trying to reinforce the negative idea that everything ought to be exceptional or else it’s a failure.
Hold on a second! I hear you shout. You’ve got to think about the averages!
I have thought about the averages, but they aren’t helpful when they’re used like this. It seems as though a product which 1,000 people liked and rated 5 stars and 1,000 people disliked and rated 1 star would average out to a 3 star rating, and that’s perfectly true. Therein lies the conundrum.
How many of the 5-star reviews were from people who absolutely loved the product? How many thought it was just meh? Did the 1-star reviewers dislike the product, or did it come DOA, or catch fire, or cause an allergic reaction?
One might think that the answer to that problem is to read the worded reviews but come on, you can’t be serious. The overall quality of written reviews aside, there’s the glaring problem of people’s bias towards writing reviews. People who hated a product are much more likely to write a scathing and potentially misleading review out of spite than people who loved something. When you love a product you would rather spend time using it than going online to provide free advertising services to a company that can afford to pay for them. When you’re even slightly upset or disappointed by a product there’s nothing more enjoyable than clicking online and writing about how terrible it is.
In statistics, distribution of data points is important. In the case of online product ratings, this is equally true. It’s essential to know how many of the five-star raters were truly blown away by the product versus how many were just pleasantly surprised, and how many only clicked the five-star rating because that’s what indicates that it turns on and works the way it’s supposed to. The same is true of the one-star ratings.
If something is phenomenally bad, then the one-stars should be large in number, but if it’s just kind of bad, then the two-stars should be more common. A one-star rating drags the average down below three-stars more than a two-star rating does. A five-star rating raises the average up above three-stars more than a four-star rating does. Was Shrek 2 really worth five stars, or was it more of a four?
Another way of thinking about this is in terms of making lists. If I asked you to make me a list of the top ten movies you think people need to watch, you probably wouldn’t have too much trouble. If I asked you for a list of other good movies you could probably jot down another 50 or so. If I asked you to give me a list of movies that are just alright and entertaining then you could probably just start writing away for a while without having to do too much thinking other than whether they trigger bad memories or not.
The same is true of bad movies. Most people could easily give a list of the 10 worst movies they’ve ever seen, and could add to that list a few dozen more which are bad but not appalling.
But what if I came to you with a list of every movie you’ve rated five stars and asked you to explain yourself. You honestly think Scary Movie and Bladerunner are on the same level? Like, seriously; an alien crash-lands in your back yard and while she’s waiting for the intergalactic tow-truck to show up she wants to lay low, Netflix, and chill.⁴ She wants to see what Earth cinema is like.
Alright great. Do you pull up your five-star Amazon reviews list and let her pick one at random? Almost certainly not. You don’t want to be showing her Sausagefest instead of Reservoir Dogs just because of chance!
That should be the criteria for five-star ratings: is it something that you consider an outstanding example of film, music, literature, or any other form of product? Contrariwise, one-star ratings should be reserved for things that are so bad, they will leave a lasting memory. A one-star rating should be something you tell anecdotes about at parties; that horrible time you bought a toaster-oven and it somehow managed to assault your dog.
There is a hiccough to my appeal, though, and we should address that: this is an entrenched standard at this point. That’s not going to be easy to overcome.
Even if a rapidly increasing number of users see the light of my argument, repent, and mend their erring ways, there will be a period of time where two competing standards are at work, counterbalancing the benefits of doing things properly.
Worse: even if everyone were to start doing their online rating correctly overnight by magic, we’d be stuck with a huge back-catalogue of things that were rated according to the old regime. One would be left looking at the ratings for a new pair of sneakers and wondering whether the four stars indicates that most people are pleasantly surprised, or that moderately more people found the shoe acceptable (or better) than people who disliked it (or worse).
That … is something I’m not sure how to grapple with, but I think I’ve got a solution:
As it stands, online ratings are not very helpful. They don’t give much information because they aren’t being used in the way that they were intended. That’s the baseline we’re at right now, so if we begin to work together as an internet community to rate things properly, and normalize the idea that three-stars out of five is good, actually, then we’ll be improving things.
In essence, ratings systems can’t really get any less informative than they are right now, so while changing our behaviors might not immediately solve all the problems, it won’t make anything worse. We’d be working towards a better future in the meantime.
So, this I implore you, internet comrades: please use ratings appropriately. When you do, we’re all better off.
¹ Dead Space is a terrific game and I love it to death. It has wonderful replay value but only about once per year.
² There may be some truth to this …
³ Technically speaking we could compress this down and say that there are only two possible outcomes: audio is or is not what I expected, and then from there we could decide whether it was better or worse in the case of it not being what I expected. However, that regime is not adequate for conveying any useful information so we’re going with three possible generic outcomes.
⁴ Not that kind of Netflix and chill; calm thyself.