Here's a sneak peek:
Movie posters aren't randomly chosen montages from a movie. They are deliberately crafted to broadcast a message, a meaning, that suggests in a blink what the movie is about. Does this imply that posters from different genres of movies have their own distinguishing signature?
Well, I decided to try and find out by using the Netflix dataset. The dataset contains about 100 million (100480507 to be exact) ratings on 17770 movies from 480189 users. That is a pretty large dataset but still quite sparse given the number of users and movies. Given the sparsity, one can still derive reasonably accurate movie similarities and most recommendation engines are driven by this idea (collaborative filtering). I used movie similarities (derived using a rough variant of Pearson correlation) to form clusters and then use the clusters to create the average of movie posters from the best exemplars within these clusters.
A few technical details here; you can jump straight ahead to the wonderful images if you'd rather not be bothered.
------
The movie posters (150X110 thumbnails really) for all 17770 movies in the dataset were grabbed from the Netflix website using a python script. I must add that breaking these movies up into clusters isn't quite straightforward and susceptible to noise. Wanting to avoid storing the entire 17770x17770 similarity matrix in floating point precision, I used a single byte for each similarity value, which results 256 (2^8) partition precision. Also, the number of users who have watched any movie varies wildly (quite possibly as a power law distribution) throwing another confound into the similarity calculation mix. Iterating clusters with Lloyd's algorithm as-is was disastrous, with all cluster centroids gravitating towards the most popular (most-watched) movies. I had to come up with my a variant that was more robust to the confounds.The average cluster posters are, as a consequence, also a reflection of how well the cluster assignment went, and I am quite pleased with my home brew!
------
On to the wonderfully abstract average cluster posters.
The most heartening affirmation of my cluster accuracy came from Dr. Who. The average poster pulls out the title rather beautifully.
Some of the best clusters were for television series, and this is reflected in the posters:
Ultimate Fighting Championship
Discovery
None of the movie names have Discovery in the title, so you can confident I am not fudging the results here.
Some of the movies that make up that poster:
Inside the Space Station
Voyage to the Planets and Beyond
Sasquatch Hunters
Extreme Engineering: Tokyo's Sky City
Extreme Engineering: Holland's Barriers to the Sea
Architectures
City of Steel: Carrier
Other remarkably good series posters:
Dark Shadows (I'd never seen or heard of this one before)
Ah, the Dragonball series. Anyone who has spent anytime with the Netflix data is bound to recognize these. The cluster is incredibly strong with this one. Lots of devoted fanboys perhaps.
Inspector Morse
Danielle Steel's !!!
This one's intriguing. Why would anyone rent IMAX movies specifically? Is it the HD home-theater people?
Antarctica: IMAX
Blue Planet: IMAX
Grand Canyon: Hidden Secrets: IMAX
Galapagos: IMAX
The Great Barrier Reef: IMAX
Whales: An Unforgettable Journey: IMAX
Tropical Rainforest: IMAX
This one was useful. None of the movies in the Netflix dataset have PBS in the title and that made it tougher to validate some documentary clusters. Well they have it on the posters! (And after looking at the cluster exemplars, I realize I could look for the phrase American Experience)
Latham Entertainment Presents: An All New Comedy Experience
Bataan Rescue: American Experience
Ansel Adams: American Experience
Woodrow Wilson: American Experience
Seabiscuit: American Experience
War Letters: American Experience
Battle of the Bulge: American Experience
The above set of posters isn't all that interesting. They are what you'd expect if your clusters were populated with movies with similar titles or other identical markers. What we are looking for are more global signifiers. I decided to sort the image according to disk size and look at the smallest images. As I had stored them all as jpeg encoded. The smallest ones would be from clusters with many exemplars and, as a result, smoothed to some form of uniformity. The larger ones with more variance would be from clusters with too few exemplars. And this is what the two ends of the size spectrum look like:
Do the more uniform images show any distinct color signatures?
Update (07/04/08): I have included color histograms and what I call color clouds following the excellent suggestion of a member on the Netflixprize forum. That is indeed a very good idea; the aggregate colors are bleached into the median ranges and averaging washes away most information about the distribution. Creating the histograms was a task in itself and deserves a separate post.
Well, the very first one looks rather sinister.
What are the exemplars in this cluster?
The Horror Within
Chupacabra Terror
Mosquito Man
Dracula's Curse
Blood Angels
Larva
Decoys
Dracula 3000
My hindsight based insight is that almost all horror movies seem to very predictably rely on blood tones and that font (Does it have a name? Count Drake font?)
Here's a very typical exemplar (Larva )
And, continuing with the color theme, blue?
It is, of course, water
On Any Sunday Revisited
Jack Johnson: The September Sessions
The Endless Summer II
The Bruce Brown Surf Collection: Surfing Hollow Days
Billabong Odyssey
Slippery When Wet
Step Into Liquid
Barefoot Adventure
Any verdant greens?
Ah :)
VeggieTales Classics: Where's God When I'm Scared?
VeggieTales: Madame Blueberry
Franklin and the Green Knight: The Movie (And this one isn't a vegetable he's just an accidental green!)
VeggieTales: Bible Heroes: Lions, Shepherds and Queens
VeggieTales Classics: Josh and the Big Wall!
VeggieTales: An Easter Carol
The obvious question: Any skin tones?
But it isn't what I expected it to be; it is a Yoga/self-help cluster
Breakthru Pilates Sculpt
Denise Austin: Fat-Blasting Yoga: 21 Days to a Yoga Body
New York City Ballet Workout
Denise Austin: Yoga Buns
Leslie Sansone: Deluxe Walk
Most of the posters for the denser clusters look rather muddy, which is to be expected given that the median grayscale values in each band are bound to be dragged down to something in the middle of the [0 255] range. However, are there any that are peculiarly bright? Yes. Kiddie Clusters.
Baby Einstein: Baby Monet: Discovering the Seasons
Baby Neptune: Discovering Water
Baby Genius: Favorite Nursery Rhymes
Baby Einstein: Neighborhood Animals
Barbie and the Magic of Pegasus
Care Bears: Adventures in Care-a-lot
Strawberry Shortcake: Get Well Adventure
The Peter Rabbit Collection: The Tale of Peter Rabbit and Benjamin Bunny / The Tale of Mr. Tod
Miss Spider's Sunny Patch Kids
Pinocchio
Chrysanthemum and More Kevin Henkes Stories
So yes, not surprisingly there are across genre similarities in posters. None of those shown here are surprising, but I guess one could come up with convincing stories for some of the other ones. Here are two tales:
This is the poster derived from a cluster of Indian movies.
1942: A Love Story
Maine Pyar Kiya
Akele Hum Akele Tum
Raja Hindustani
Khuda Gawah
Raju Ban Gaya Gentleman
Virasat
Khiladi
Sharaabi
Dil
Dil Hai Ke Manta Nahin
Beta
Hum
Yes Boss
Kaho Naa Pyaar Hai
Ram Lakhan
Chandni
Fiza
Yeh Dillagi
Mr. India
Being Indian myself, I can get away with saying that this aggregate poster reflects most of the early movie posters that had a certain format where the main protagonists would be shown centered in the poster with the names almost always at the bottom in blocky font. Here are a few examples:
Now consider this one from the Christmas cluster:
A Christmas Carol
Scrooge
Nine Dog Christmas: The Movie
Ernest Saves Christmas
Cartoon Network Christmas: Yuletide Follies
The Year Without a Santa Claus
Welcome to Mooseport
National Lampoon's Christmas Vacation 2: Cousin Eddie's Island Adventure
Rudolph the Red Nosed Reindee
The aggregate seems to have three bands of text across the top, middle, and bottom.
And looking at the some of the posters, that does seem to be the case
Christmas is typically a movie heavy season and I believe November and December are the densest months ratingswise in the Netflix dataset. Increased competition for eyeballs and a heavy demands for seasonal movies seem to combine in the form of text heavy posters that scream Christmas and suggest that the consumers not think anymore before reaching out for the dvd cases.
What other stories can we glean or weave from these aggregate posters?

Great experiment, very interesting, and the color clouds are an interesting depiction -- maybe superimpose them on the color gamut?
Also, I can see how using a bunch of water-themed movies creates sort of "eigen-poster", or using IMAX movies, etc. Maybe try doing this with a bunch of similarly rated movies from the same genre -- like, all the 5-star movies from Drama, or all the >3 star movies from Action... this might relate movie posters to genre preferences...:D
Good stuff!
Posted by: Anonydave | July 18, 2008 at 06:18 PM
Thanks Dave!
I did in fact do this with similar movies from a genre, but I guess the genre is a description of scale too. At a coarser scale you have say Drama (or Horror, which is included above). Within that coarser scale you have splinter scales like Danielle Steel, or the best rated movies. I didn't find spatial averaging too helpful for the rest, so I am assuming that throwing more movies into the averaging by stepping to a coarser genre definition will not be helpful. The histograms might be better though.
Superimposing on the color gamut is a good idea. I tried a variant of that with the third dimension in a color gamut representing the density. Kind of a surf plot, but the end result was too hard to decipher.
Posted by: Sai | July 19, 2008 at 10:14 AM
Hi,
Interesting experiment, I'd like to try it too. I found you matlab code on Netflix prize forum. Can you share your Python script that grab posters (using movie title, I assume) from Netflix web site ?
Posted by: B Yang | January 22, 2009 at 03:12 PM
B Yang,
Sent you the code. Have fun playing with it.
Posted by: Sai | January 23, 2009 at 02:28 AM