Scratch Paper

Thursday, October 8, 2009

I also really like these companies that create new ad-hoc employment marketplaces

For example, JustAnswer is a company that lets random people post questions on a variety of topics (from auto repair to programming to relationship advice). The questioner specifies how much the answer is worth to him. Then the question is posed to the community. Whomever answers the question in a way that satisfies the questioner gets paid.

Similar to lots of other efforts to "crowd source" resolution of various questions. Major difference (with JA) is that the answerer gets remunerated (i.e., paid for his trouble). Another major difference with JustAnswer in particular is the amount of due diligence they do in order to "certify" various folks (potential answerers) as being Expert-level in some category. The idea being that if I ask a question about an area I don't know much about, I can rely on JA's accreditation of folks as a gauge of who is trustworthy. I took their test to become certified as an expert in programming and found it reasonable: 10 multiple choice questions, not cake, reasonably generic (rather than w.r.t. specific idioms/languages/paradigms).

But basically I am always really pumped about products that create opportunities for small-scale entrepreneurs to run their own businesses — whether that business consists of answering questions (using JA), writing apps for the iPhone, writing apps/games for Facebook (or some other social network), etc.

We also saw this with Coghead awhile back which let individual coders (or small teams thereof) create applications on the Coghead platform and then sell those apps as a service to whomever. So an individual programmer becomes in effect a SaaS provider by leveraging the Coghead foundation. Of course we see something similar with AppEngine — where Google worries about hosting, server administration, and so on. All you have to do is create your Web service (in Java or Python) via providing code to execute when particular URLs are visited and you're off to the races.

The whole thing results in a lot more liquid, task-specific employment market — and gives workers a lot more options. In a way it's not different than the advances we've seen in resource management in computer science. Fueling the resurgence of virtual machines the past decade was the fact that companies had servers that were not being used wisely. Might have a dedicated server for each different service (e.g., mail, web, news, ftp, ...) which resulted in poor resource allocation because some resources sat idle much of the time while others were (at the same time) oversubscribed. With virtualization, resource allocation became more fluid. A high-demand service could be spread across machines whereas quieter (lower demand) services could all safely share the same hardware without adversely affecting one another.

In any case, I'm way bullish on any company or product that encourages this same "efficient resource allocation" of human capital. Develop a more just-in-time marketplace where matching humans to jobs occurs dynamically and in response to real-time demand. Taking the real-time per-impression ad-auction paradigm and applying it to humans and jobs/work to be done.

Wednesday, October 7, 2009

Some interesting companies (just the gist), 10/06/2009

Rapleaf: http://www.rapleaf.com/

Figure out what your customers are doing on the Web as far as social networking etc. goes; for individuals: figure out what is being said about you on the Web.
TrialPay: http://www.trialpay.com/

Real-time, dynamic generation of "deals" for consumers; so if you're at the WinZip site but decide not to buy the software then later are about to make a purchase at Wine.com, you might be offered a deal to let you get that WinZip software for cheaper — along with your wine. Rather than deals being a one-size-fits-all static kind of thing, a deal is generated for a specific user based on that user's past behavior AND based on what it might take to convert that user.
JustAnswer: http://www.justanswer.com/

Everyday people can pose questions about everything from car repair to computer repair to relationships. A person provides his question along with what it's worth to him to get the question answered. Then a swarm of experts will jump in and try to resolve the matter — earning the fee in the process.

Not so different from aardvark.com and fixya.com except that the answerer gets reimbursed in this case. Also, JA takes steps to make sure that the answerer is qualified to answer the question. In order to sign up to be a Programming Languages expert, I had to successfully complete a straightforward multiple-choice quiz of programming questions.
Adify: http://www.adify.com/

These are guys responsible for building the various vertical ad networks; e.g., the Martha Stewart Living ad network whose anchor is the Martha Stewart site but which also includes related sites/publishers.
meebo: http://www.meebo.com/

Overcomes the historical limitation of chat programs which was that a chat client was tied to a particular chat network and so you could only chat with someone on your same network (sounds almost insane to type this; what a dumb way to do things!). With meebo, you chat via a Web interface and you can chat using an AIM, Yahoo, MSN, MySpace, Google Talk, etc. client and you can chat with anyone using any other chat client. You can also easily create chat rooms.
Yelp: http://www.yelp.com/

Standard review site, like CitySearch. Provides reviews of dentists, restaurants, dating services, clothing stores, and so on. A user can vote on reviews and a business can respond to reviews about it. Reviewers are encouraged to generate original content such as lists of the top happy hour spots, the best place for a cheap mani/pedi, and so on. Reviewers are also encouraged to share more info about themselves — to personalize their profiles.

When you search for something on Yelp, you might also be provided with sponsored search results. That is to say that Yelp probably derives a substantial portion of its revenues from search marketing (i.e., selling text ads based on keywords). A business can also pay Yelp to put one of the reviews of this business at the top of the business's page. So that when a user visits that business's page, the first review she sees is this selected one. The business still has no control over its other reviews — including over the order in which those reviews appear.

TODO: Look into: whether they deliver display ads yet, how they sell the ads that appear in search results — based on keywords? via an auction? on an impression-by-impression basis? and so on. Basically, do they roll their own ad sales/placement or are they a publisher that belongs to some network, such as AdSense — and hence they outsource all ad-sales-related activities. Presumably since it's Yelp's primary source of income, they manage ad sales directly.
Cooliris: http://www.cooliris.com/

Their products are Cooliris and CoolPreviews. Cooliris is a browser extension which lets you explore the contents of your disk drive — in particular to look at the pictures in any folder and navigate through those pictures in some cool way. So that is the heart of the Cooliris proposition — the way that you navigate through your photos, the way that they present those photos to you, and so on. So in a way, what Cooliris competes with is your computer's traditional way of displaying your photos to you. For Windows, if a folder contains pictures, I can see small thumbnails of each picture (maybe each is 1.5 inches by 1.5 inches) or I can view the photos in a slideshow. With the slideshow presentation, I see a much larger version of each photo but I can only see a single photo at a time.

By contrast, with Cooliris, you can simultaneously be looking at a large collection of photos, each of which is a much larger thumbnail. So for starters, you might be looking at two rows of photos, each of which consists of around 4 photos. And the thumbnails for each photo are much larger than traditional thumbnails — maybe 4 inches by 2 inches (but the sizes are variable). Then the new thing is that you can rotate those rows so that — instead of looking at the two rows (one covering the top half of the screen and the second covering the bottom half) straight-ahead — you are looking at the rows from a sidelong glance. This lets you simultaneously look at more than 8 photos. Maybe more like 12. The effect is that your screen becomes 3-dimensional and the plane containing the photos pivots on its leftmost vertical edge or its rightmost vertical edge, like this:

By presenting the photos to you using three dimensions, they can present more photos for you to consider at once. Moreover, as you scroll through the rows of photos by using the sidelong view, the photos cruise by you. So it's a very dynamic, bouncy-even presentation.

Now they take this same technology for viewing photos on your hard drive and apply it to viewing photos of products at the marketplace. So there is a way to "go shopping" from your Cooliris plug-in. Similarly, you can visit various channels, each of which contains a collection of videos related to some topic such as Sports, News, Sci-Tech, Entertainment News, TV, and so on. See also: http://www.pcmag.com/article2/0,2817,2353223,00.asp
Jigsaw: http://www.jigsaw.com/

This is a modern version of the Dun and Bradstreet databases that I remember using in my investment analyst job — in order to gain information on particular companies. Jigsaw maintains a database of both companies/organizations and individuals. The idea is that, for any individual, you should be able to view that individual's current position and contact info (basically his business card). Also, one can go from an organization to identifying key individuals within that organization (or can just learn more about the organization generally). So if you know you need to talk to the VP of Sales, you can figure out who that is.

A user can "get points" by adding information about contacts and companies. The information you enter is associated with you — so that there's the basis for reputation scores, that is, for discounting information in the Jigsaw database depending upon who provided that information. Viewing certain info about companies/individuals requires redeeming points (i.e., so you must have previously contributed to the Jigsaw database) OR a paid subscription.

Primary users are expected to be folks in sales, marketing, or recruiting. See also: http://www.jigsaw.com/corp/jigsaw_corporate_overview.pdf

TODO: Play with their data; how good is it? How accurately does it capture individuals/businesses of which I am aware?
box.net: http://www.box.net/

Is a way for a company to host documents in the cloud so that employees in arbitrary locations can collaborate on these documents. The documents can be internal (not publicly visible) or external (publicly visible). Enables creation of online file systems. Users can collaborate on documents. Can use their interface to share (via configuring permissions), access, and manage files in the cloud. All communication (including downloading, uploading, editing files?) takes place over an encrypted pipe (SSL). Your files are stored on multiple servers so that, if one hard disk drive fails, your data is not lost.

An alternative to Microsoft SharePoint, which is derided as "too complicated." An improvement over emailing docs back and forth between team members (collaborators). Better than FTP where there is no notion of "merging changes;" if one user changes the doc and replaces the old version on the FTP site, that old version is gone. So changes occur at document-granularity. Hence, their primary value proposition is the ease with which one can use their system.

Sharing: can share individual files or a whole folder of files. Share with someone via providing their email address. Each document can have an associated thread for comments. Can also create "discussions," which are presumably comment threads that are updated in real-time (i.e., not at all different from a chat room on meebo). Can create tasks, to organize work flow. All of your documents, tasks, comments, and discussions can be searched and modified by the people you elect to provide these capabilities to. (Actually, a document's contents cannot apparently be searched without a box business subscription.)

With a box business subscription, you can view previous versions of a document, customize the box interface, search the content of your documents, view reports, manage users through an admin console.

TODO: register and play with their interface, including observing how intuitive and easy-to-use their sharing interface is — as well as how scalable (how easy and intuitive to maintain many sharing policies, to create sharing policies which apply to data not-yet-created, and so on). Etc.
meraki: http://meraki.com/

Traditionally, if you wanted to create a wireless LAN for your enterprise (), you would buy a collection of access points (APs), each of which was a hardware device that provided wireless Internet access to users within some radius of the device. To manage your wireless network then you had to log into each separate AP and configure it then monitor its status. That is, you couldn't log into a centralized server in order to get a network-wide view or in order to configure a network-wide policy. You had to interact individually with each separate AP.

Recognizing the need for centralized monitoring and configuration, vendors introduced a number of controller-based systems with thin APs (also called dependent APs). Unlike standalone APs, most thin APs cannot operate on their own. Rather, they rely on one or more WLAN hardware controllers that need to be installed in wiring closets.

In this scenario, the hardware controller is a centralized management interface. It directs traffic between wireless and wired networks. It also lets clients roam from one AP to another. The AP connects to the controller over an Ethernet cable; through this cable, the AP obtains both Internet connectivity and power. This set-up is referred to as a: controller-based deployment with tethered APs. Controllers are expensive, they have installation costs. Also, when a WLAN controller fails, all APs connected to that controller lose connectivity (i.e., fail). So the controller is a single point of failure. ("Dual-redundant controllers, while technically possible, are often prohibitively expensive.")

The next phase of evolution was to not require that APs physically connect to a controller but rather have a logical tunnel back to the controller. Presumably in this case, the AP derived its power by plugging into a wall socket. And there was a wireless connection from the AP to the controller. Sending data through an AP means the data would also go through that AP's associated controller. With this architecture, still need expensive, redundant controllers. This doesn't work so well if have multiple sites that all need to be on the same wireless LAN.

With that backdrop, then, meraki's solution entails a single piece of hardware: the access point (actually a set of access points depending upon the number of clients and size of the region for which wireless service is to be provided). Then these access points connect to meraki's data center which consists of a bunch of servers which are used to configure, manage, and monitor a wireless LAN. So the hardware controllers are effectively centralized and moved to the cloud. Then multiple different wireless LANs can be run/managed using the same hardware controllers (so, some virtualization going on).

The Meraki Cloud Controller is out of band, which means that client traffic never flows through it... Control traffic flows between the APs and the Cloud Controller via a persistent tunnel. All sensitive data, such as configuration details, user names, and passwords, are encrypted... Multiple geographically distributed data centers are used to ensure that networks continue to function even in the event of a catastrophic failure. All management is done remotely through a Web browser... The administrator can also remotely diagnose the APs, using standard tools like ping, from the Meraki remote management interface.

TODO: Spend more time understanding the various architectural elements and how the meraki system differs from previous solutions etc.
Rocket Fuel: http://rocketfuelinc.com/index.html

It's an ad network. Their customers are advertisers (who want to run a campaign) and other ad networks (who want to add intelligence to their targeting and optimization but lack the in-house expertise to do so). RF also has partnerships with publishers (on whose sites the ads are run). From their site:

Rocket Fuel Inc. is building the first intelligent ad serving technology platform that combines the best of social, behavioral, contextual, geographical, search and other data sources to understand consumer interest and intent... We’re experts at predictive modeling and customer segmentation. Our core expertise is in developing and using technology to process and scale huge amounts of data to predict the likelihood of responses from individual users – we find audiences designed for your needs.

There should be no contextual ad networks or behavioral ad networks. It's like debating whether you should have eyes or ears. If you can have both, it's a win and there's no point debating. There should just be smart ad networks that use as much relevant information as they can to pick the best impressions for a given campaign. An ad server should just use whatever data is found to be valuable in selecting the best ad for an impression, or the best impression for an ad.

Monday, September 21, 2009

So I recently added Google Analytics code to this blog...

Mostly to play with the functionality and so on. But a funny thing happened on the way...
You are given a code snippet to add to your page's HTML; a portion of which is copied here:

<script type="text/javascript">

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

</script>

...

So I copied and pasted the above — as is — to the window of the Blogger editor (for editing blog posts). The Blogger editor has two editing modes: Edit HTML — which is supposed to let you directly edit the page's HTML — and Compose — which is more like an MS Word interface to editing a blog post; one can easily bold, italicize, change font, create bullets, and so on in this mode (using buttons rather than specifying HTML markup tags). Then that editor transforms your entered text into HTML.

Because I was given HTML code though (for analytics tracking), I used HTML mode and copied the given JavaScript scripts (as above) directly into the bottom of the page then saved and published. However, some *additional* code was added — by the Blogger editor — to the scripts that I inserted. In particular, XHTML line breaks were inserted within the JavaScript script. So, the above became:

<script type="text/javascript"> <br/>

var gaJsHost = (("https:" == document.location.protocol)? "https://ssl." : "http://www."); <br/>

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

</script>

So now I have XHTML line breaks (<br/>) within the analytics tracking scripts, which caused any JavaScript interpreter (parsing the above script) to choke and stop processing the script — and hence not to perform the analytics-related tasks. This explains why I wasn't seeing tracking information for clicks that I was sure had landed on my pages. The fix was simple of course: remove all line breaks in the analytics scripts. Interesting all the same.

(How did I verify that a JavaScript interpreter chokes when it sees <br/> mid-script? Using this sandbox.) (Note also that as far as Google Analytics was concerned, my tracking code was installed properly, which means their checker needs to be fixed.)

Another curiosity is whether I should create separate Analytics "profiles" for each different page of my blog. And whether visiting my blog homepage will cause the analytics tracking scripts (which are inserted within every individual blog post) to *all be executed*. That is, I added analytics tracking scripts to each individual blog post. And when you visit my blog, you automatically see all posts. So presumably, every time I get a visitor to my homepage, all N analytics scripts will execute (where N is the number of blog posts) and so it will seem as if I've had N * x visitors (where x is the number of people who visit lizstinson.blogspot.com).

Tuesday, September 15, 2009

(some) Google Technology, briefly: PageRank, MapReduce, Bigtable

Below are some brief notes, meant only to capture the Big Picture. This simplification may come at the cost of precision and even accuracy. My apologies to the creators. By way of summarization, Google's philosophy is "in-house development designed to run on commodity hardware."

PageRank:

One signal of many. Examine entire link structure of the Web. Identify which pages are most important. Note that this is not the same thing as identifying which pages are most important relative to some query or within a given vertical. I think it's interesting that they use this overall importance measure rather than identifying importance within a given domain. Then if it's important to have an overall metric, assign relative importance to each domain then calculate each page's resulting overall importance. In any case, from my incredibly cursory reading, it appears this is not what they do. They have an overall importance metric. Separately, they perform hypertext matching to compare Web documents to the query string.

The rationale behind my computing "overall importance" within a domain is that what seems unimportant one day can be quite important the next, e.g., consider pages dealing with "anthrax vaccines" pre-anthrax-letter-mailing and post-. The relative importance of such pages increases dramatically. And so by tweaking relative importance of various domains, we can automatically adjust the relative importance of all documents within those domains. But certainly this approach would have all sorts of challenges of its own, such as what the hell are the domains? How to handle documents that cross domains? Can domain identification be automated? And so on.

In any case, your search results are obtained by considering the relative importance of all pages on the Web and combining that information with which pages best match your particular search query. So it seems that the hypertext analysis is used to determine which documents match and the "overall importance" metric is used to rank the matching documents. So is the idea that all documents match equally? Probably not; probably they use buckets. So all documents in this bucket match approximately equally; hence, use "overall importance" to rank the documents within this bucket. Do the same for the next bucket and so on. Ranking of results across buckets could be done by using all results from the first bucket (which perhaps maps the most closely) followed by all results from the second bucket (which maps the second most closely) and so on. Or something more sophisticated where the results from various buckets are interleaved.

Note that the parsing they do of Web pages (in order to build inverted indices which will be used to perform hypertext-matching) does not just scan the page's content. It's more sophisticated, taking into consideration fonts, the way a page is organized or divided, where a word occurs within a page (early on or later). This is meant, in part, to frustrate those who would manipulate results by inserting metatags which suggest that a page contains data that it does not actually visibly present.

References:

MapReduce:

The problem that MapReduce was designed to solve was that Google (and others) had large sets of input data that they needed to perform computation over. If they ran that computation on a single computer, it would take a very long time. So they needed a way to break up the computation into smaller chunks (i.e., parallelize it), distribute those chunks to multiple computers, then merge the results. The raw data in this case might be, for example, crawled documents or web request logs. And the computation or processing might be to: (1) identify the most frequent queries, (2) represent the crawled documents graphically, (3) summarize the number of pages crawled by each host, or other tasks. The processing or computation is straight-forward but the input data set is HUGE and so it has to be split across multiple machines; otherwise it would take forever.

There are a number of questions: How to parallelize this computation? How to distribute the (input) data (to the various machines)? How to handle failures?

So MapReduce is an abstraction that was introduced to allow solving common problems (such as how to handle failure) one time — in the MapReduce implementation — rather than having this machinery be reinvented for each different problem instance. MR "hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library."

The various tasks that they were needing to solve in a distributed manner involved a couple common components. That is, the meat of those tasks could be broken up into two phases: (1) MAP: apply a function to each input item in order to get a set of intermediate key/value pairs; this function is provided by the user and essentially decomposes the input data in a way that makes sense given the target computation; (2) REDUCE: given several sets of intermediate key-value pairs (i.e., one set is generated by each machine participating in the computation and acting on some input data), merge the key-value pairs that have the same key.

As a practical example, consider the problem of identifying the most frequently queries given a set of web request logs. Maybe we'd give one web request log to each participating machine. Then each machine would parse that log using the MAP function, which would result in the machine having a set of key-value pairs where the key is the query string and the value is the frequency with which that string appears in this web-request log. In this case, REDUCTION is trivial; in particular, reduction involves merging all machines' results, which means that — for each machine which has a key in common — we add those machines' values for that key to obtain the key's ultimate value (i.e., total frequency). Remember that the "value" in this example is the frequency with which a query string (the key) appeared in the given machine's input data (web request log). Hence, REDUCE here is arithmetic addition but you can imagine a case where the combination of intermediate computation results is more complicated.

Note then that the steps are as follows:

Invoke MAP on an input pair; results in MAP returning a set of intermediate key-value pairs.
MR library obtains these key-value pairs from all participating machines. Then MR library looks at this aggregate set of intermediate key-value pairs (those contributed by each machine) and identifies common intermediate keys.
For each common intermediate key identified in the previous step, invokes REDUCE — supplying the common intermediate key I and all (intermediate) values for I. This function combines those values to obtain a single output value (corresponding to the given input key). Note that in principle, REDUCE could result in merely a different set of values (rather than being required to produce a single aggregate output value); that said, "typically just zero or one output value is produced per Reduce invocation."

Note that the way that the values (for the given intermediate key) are supplied to Reduce is via an iterator. This makes it OK for an intermediate key to have a set of associated values that is too large to fit into memory.

How to handle failure? Re-execute. In particular, failure is presumably a single machine dropping dead (or some small number of machines). And since the computation on each such machine is independent of that on other machines, we can restart or re-execute the computation that occurred on the failed machine without having to restart the entire job.

MR is a widely applicable model:

Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0005.html

MapReduce "is often used for generating and modifying data stored in BigTable[2], Google Reader,[3] Google Maps,[4] Google Book Search, "My Search History", Google Earth, Blogger.com, Google Code hosting, Orkut[4], and YouTube[5]."

Source: http://en.wikipedia.org/wiki/BigTable

References:

http://labs.google.com/papers/mapreduce.html (2004)
http://labs.google.com/papers/mapreduce-osdi04.pdf (paper)
http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
http://en.wikipedia.org/wiki/BigTable
Hadoop provides an open-source implementation of Map-Reduce; see: http://en.wikipedia.org/wiki/Hadoop.

Bigtable:

It's a database and its implementation is built on top of Google's File System (GFS), Scheduler, Lock Service, and MapReduce. Non-Googlers can use Google's Bigtable implementation by developing programs for Google's AppEngine which use the engine's datastore (which sits on top of Bigtable). "Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth."

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).

Motivating example: Webtable, "a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched..." This retention of time-modified information enables versioning.

A table does NOT have a fixed number of columns. Each table has multiple dimensions; it might have a row for each URL and columns corresponding to: the content stored at that URL at some time, the language of that content, ... Each cell would have an associated time as well, making every two-dimensional table at least a three-dimensional table.

Each table is split — at row boundaries — into multiple tablets. That is, a range of rows is dynamically partitioned into a tablet. So a tablet might contain some number of rows; all of a row's contents can be found within a single tablet. Tables are divided into tablets in order to get tablets of size approximately 100 — 200MB. Each machine stores around 100 tablets (so 100 * 200MB == 20GB) [according to notes from a talk in October 2005, so that may have changed substantially as the cost of disk space continues to drop]. This sizing is done to optimize Bigtable for the underlying file system implementation and:

This setup allows fine grain load balancing (if one tablet is receiving lots of queries, it can shed other tablets or move the busy tablet to another machine) and fast rebuilding (when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get new tablet, but the load on each machine to pick up the new tablet is fairly small).

Special tablets keep track of the file system locations of data tables (akin to how a directory structure keeps track of where particular file blocks are stored on disk). These "meta-tablets" keep a mapping between tablet ID and location within the GFS. If a tablet or segment exceeds the 200MB limit, they use compression to reduce the segment's size.

Bigtable Data Model

A table is really a multi-dimensional map which is indexed by row key, column key, and timestamp. Each value in the map (i.e., cell contents) is an uninterpreted array of bytes. The map is distributed and persistent, which means that different parts of it may be stored permanently on different machines in different locations. It's also sparse (though this seems to me more a function of the table's contents than of the table itself), which means that lots of cells have no values; for example, if each row corresponds to a URL then there are many more columns (each corresponding to a feature) than most URLs exhibit. As a contrived example, imagine that there was a column for each possible language. Then a row (URL) will only have an X in the column corresponding to the language in which this URL's content is written. Hence, all other language columns would be empty for this row.

The Bigtable paper represents a table as having three dimensions; hence, one can access a cell value by providing the string identifying the row, the string identifying the column, and the 64-bit integer representing the timestamp. Alternatively, by just providing the row and column, one obtains an array of values contained in that cell at different times (value v0 at time t0, v1 at t1, and so on).

Dynamic Control over Data Layout and Format: There is a claim that clients can specify the locality of data (namely, Bigtable "provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage."). In particular, Bigtable stores data in lexicographic order by row key. Then the row keys that a client uses determine which data will be stored together and hence enable the client to group data that should be accessed together in adjacent rows. I'm not sure how frequently the "dynamic partitioning" of a table into tablets occurs (whether it's something that's done once up front and retained thereafter OR IF INSTEAD a table might be repartitioned as the contents of various rows grow/shrink).

For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.

Sample Application — Google Analytics, which:

provides aggregate statistics, such as the number of unique visitors per day and the page views per URL per day, as well as site-tracking reports, such as the percentage of users that made a purchase, given that they earlier viewed a specific page.

To enable the service, webmasters embed a small JavaScript program in their web pages. This program is invoked whenever a page is visited. It records various information about the request in Google Analytics, such as a user identifier and information about the page being fetched. Google Analytics summarizes this data and makes it available to webmasters.

The Bigtable for this application consists of one row per end-user session, where the row is named by a tuple consisting of the website name followed by the time the session was created. Then sessions that visit the same site are contiguous in the Bigtable (which is kept sorted lexicographically by row name, recall) and are sorted chronologically. A summary table (approximately 20TB) is generated periodically for each such "raw click table" by running various MapReduce jobs which extract info from the Bigtable (approximately 200TB).

Other applications: Google Earth and Personalized Search.

A lesson they learned in building Bigtable was not to add new features until there's a use case for them.

References:

http://labs.google.com/papers/bigtable.html (2006)
http://en.wikipedia.org/wiki/BigTable
http://andrewhitchcock.org/?post=214
There is also an open-source implementation of Bigtable as part of Hadoop; it's called HBase: http://hadoop.apache.org/hbase/.

Tangential items that came up in writing the above:

Inverted indices: a non-inverted (or forward) index contains a list of documents and, for each document, identifies the words that appear in that document. An inverted index then lists the words (i.e., is indexed by) and identifies which documents each word appears in. So an inverted index maps from content to location. There are two types of inverted indices: (a) a record level inverted index (or inverted file) provides, for each word, references to all documents containing that word. The second type is a word-level inverted index (a.k.a. inverted list or full inverted index) which lists, for each word, the documents in which that word appears and, for each document, the position(s) within that document at which the word appears.
http://en.wikipedia.org/wiki/Index_(search_engine)
http://en.wikipedia.org/wiki/Inverted_index
http://en.wikipedia.org/wiki/Full_text_search
Functional (programming) languages:

Saturday, August 15, 2009

Notes on The Long Tail by Chris Anderson

The Long Tail, by Chris Anderson

Introduction

Our culture has been fixated on best seller lists for about a century. The reason is that the distribution costs for movies, television shows, music groups, even books were such that only a relatively few could be distributed. So the goal was to maximize the audience for the relatively few TV shows, movies, and music groups that are distributed so as to recoup costs. As a practical example, consider your local bookstore. It only has so much shelf space. So naturally it wants to choose the books to populate that shelf space that have the greatest chance of selling. And so the focus on best sellers. Anderson refers to the condition wherein only a small fraction of possible products is provided to an audience as economic scarcity.

For some people, a blockbuster movie really is the best movie they can imagine; whereas for other people (who also see that blockbuster movie), the movie is merely good enough. So there is a sheepherding effect of broadcast distribution where audiences are cajoled into seeing a movie, for example, because it's the best available.

With broadband, every audience member can access different content (as opposed to broadcast which entails sending a single signal to all audience members). Consequently, the old paradigm was a single mass market (a state that is optimized for broadcast economics) and the new paradigm is millions of niche markets. When users are given choice, they exercise it. Furthermore, the best sellers are selling comparatively fewer instance (than they sold before); so these niche markets are taking from the single mass market.

These niche markets always existed; the difference is that the cost of reaching their natural customers has fallen. Anderson presents an analogy for these falling distribution costs; in particular, he likens them to a receding tide. As the tide falls, an underlying landscape is revealed that had been there all along, just underwater. "Think of the waterline as being the economic threshold... the amount of sales necessary to satisfy the distribution channels."

These niches are a great uncharted expanse of products that were previously uneconomic to offer. Many of these kinds of products have always been there, just not visible or easy to find.

Now the niche products are available via Netflix, iTunes, Amazon, ...

Quiz: given a digital jukebox which contains 10,000 albums. What percent of albums sell at least one per quarter? The answer: 98%. "What if non-hits — from healthy niche products to outright misses — all together added up to a market as big as, if not bigger than, the hits themselves?" This 98-percent rule held for Amazon, Apple, Netflix. In looking at the distribution of sales across this vast expanse of inventory, still have lots of sales for a relatively few hits. Then it fell off steeply but never to zero. This is a long-tail curve because the tail is very long relative to the head. Anderson reflects: "I realized that, for the first time, I was looking at the true shape of demand in our culture, unfiltered by the economics of scarcity."

His claims: the tail is longer than we realize (i.e., there's more variety), people can economically access products at the tail (in contrast to pre-broadband times), the aggregation of all those niche markets is itself a significant market. New efficiencies in distribution, manufacturing, and marketing — changing definition of what was commercially viable. The Long Tail "is really about the economics of abundance — what happens when the bottlenecks that stand between supply and demand in our culture start to disappear and everything becomes available to everyone."

Chapter 1: The Long Tail

The tyranny of locality: need to find a sufficiently large local audience for a product in order for it to be economically viable to carry that product. E.g., a movie theatre will not show a film unless it can attract an audience of at least 1500 people over two weeks. The same applies to book or music stores, movie rental joints, and so on. So lots of movies might have a great national audience but can't crack into the various local markets because of this hurdle.

Another limitation of the physical world is that there are limited distribution channels for many goods. A book store shelf can only hold so many books; a radio spectrum can only be divided into a finite number of bands; a coaxial cable can only carry so many TV channels; and so on.

These conditions have meant that one needs to aggregate large audiences in one geographic area. And this has led to the focus on releasing hits (as a way to ensure that the audience-size hurdle is cleared and a profit can be generated). He refers to this as hit-driven economics.

The long tail is long. Consider Rhapsody (an online music retailer) which has (at print) 4M tracks. Every one of Rhapsody's top 1M tracks is streamed at least once a month. Also, if you add up all the non-hits, you get a market that rivals the hit market. For Amazon, 25% of its book sales (by unit? or by aggregate value or dollar amount?) come from outside of the top 100,000 titles. And this percent is the fastest growing part of their businesses. "The act of vastly increasing choice seemed to unlock demand for that choice." "More than 99% of music albums on the market today are not available in Wal-Mart."

Also, these fringe sales are incredibly cost efficient since it costs very little for Amazon to carry another title (much less than it would cost a traditional brick-and-mortar retailer which would need to allocate shelf space for the title and so on). And for purely digital services (where don't even need a warehouse to carry the physical product because there isn't one) such as iTunes, the cost efficiency is even greater.

When you can dramatically lower the costs of connecting supply and demand, it changes not just the numbers, but the entire nature of the market. This is not just a quantitative change, but a qualitative one too. Bringing niches within reach reveals latent demand for non-commercial content. Then, as demand shifts toward the niches, the economics of providing them improve further, and so on, creating a positive feedback loop that will transform entire industries — and the culture — for decades to come.

Chapter 2: The Rise and Fall of the Hit

American was a rural society where people were distributed across the land — creating a very fragmented culture. And so developed regional accents and traditions. Then the industrial revolution brought people to the cities. Then media were developed which created a common culture: commercial printing, photography, the phonograph. Pop culture was transmitted using these technologies, which linked people together and synchronized them. All were reading the same newspapers.

Then broadcast mediums were developed: radio and TV. But initially these were just able to generate signals strong enough to reach local and regional (rather than national) audiences. Then AT&T developed long distance networks, over which voice and audio could be transmitted nationally. The first broadcasts were carried nationally by going over long distance phone lines. 1935 - 1950s was the Golden Age of Radio: Edward Murrow, Bing Crosby. Then TV took over as the principle inculcator of shared culture.

But for the music industry, it peaked in 2000 and by 2005, music sales had dwindled more than 25% from their peak. And the number of hit albums dropped more than 60%. "In other words, although the music industry is hurting, the hit-making side of it is hurting more. Customers have shifted to less mainstream fare..."

Where did the customers go? One hypothesis is that people continued to acquire new music but did so from free providers such as Napster. So it isn't that demand dropped, it's just that this demand could be satisfied without a commercial transaction or sale. Anderson points out that services such as Napster not only provided music free of cost but they also substantially increased the available variety of music. BigChampagne tracks all files shared on the major P2P networks. "What it's seeing in the data is nothing less than a culture shift from hits to niche artists." He also mentions the existence of mashups which are combinations of tracks: one track from one artist played (or layered) over another. Or similar transformations of original content that results in new content. A micro-hit is a hit within a narrow music niche, of which there might be hundreds of thousands.

The traditional model of marketing, selling, and distributing music has gone out of favor. The major label and retail distribution system that grew to titanic size on the back of radio's hit making machine found itself with a business model dependent on huge, platinum hits — and today there are not nearly enough of those. We're witnessing the end of an era.

But current media industries are still oriented to the hit-making model; it's still all about finding and creating blockbusters. "Setting out to make a hit is not exactly the same thing as setting out to make a good movie." And "We are turning from a mass market back into a niche nation, defined now not by our geography but by our interests."

Chapter 3: A Short History of the Long Tail

He makes the point here that the Long Tail phenomenon is not different in kind than previous innovations, only in scale and digital particulars. For example, back in the late 1800s was when Sears Roebuck got his start, offering customers a much wider variety of things and at much better prices than those folks could get in their local "general stores." Sears was able to make this work by creating a mail-order catalog and buying goods in volume. He also listed in his catalog stuff that he sourced from other proprietors (a network of suppliers), who would then ship stuff directly to the customer — which sounds an awful lot like Amazon's Marketplace.

Sears and Roebuck also developed innovative supply chain efficiencies. In particular, they studied and improved the assembly line of preparing goods within the warehouses for shipping to the customer. Finally, they used "viral marketing" by having customers share catalogs with non-customers and giving a referral bonus (discount on goods) to the sharer.

With the development of the Internet, a lot of the constraints of bricks-and-mortar establishments (amount of shelf space, locations, staff, working hours, weather) fall away — leaving unlimited selection.

Chapter 4: The Three Forces of the Long Tail

The theory of the Long Tail can be boiled down to this: Our culture and economy are increasingly shifting way from a focus on a relatively small number of hits (mainstream products and markets) at the head of the demand curve, and moving toward a huge number of niches in the tail. In an era without the constraints of physical shelf space and other bottlenecks of distribution, narrowly targeted goods and services can be as economically attractive as mainstream fare.

"The true shape of demand is revealed only when customers are offered infinite choice."

Lots more non-hits than hits.
The cost of reaching people who would be interested in the non-hits is falling.
Need ways for such people to find or locate such goods. These ways (he refers to them as filters) "can drive demand down the Tail."
Once more niche goods generated and people learn about them, "the demand curve flattens. There are still hits and niches, but the hits are relatively less popular and the niches relatively more so."
There are so many niche products that if you add up demand for all of them, its comparable to demand for the hit products.
Only when infinite choice has been made available to consumers in a way that they can grok it can we determine "the natural shape of demand" — "undistorted by distribution bottlenecks, scarcity of information, and limited choice of shelf space."

A Long Tail occurs when it becomes cheaper to reach niches (as with Amazon, which can offer random books to customers that would not be economical for local booksellers to offer — since the local booksellers wouldn't get enough sales of the random books to justify keeping them in stock). There are a few specific factors that together encourage TLT:

1. Democratize the tools of production.

Basically, make it so that just about anyone can generate stuff. The PC did this; now, webcams and software for blogging and for recording/editing music and video are ubiquitous. "The result is that the available universe of content is now growing faster than ever... the number of new albums released grew a phenomenol 36% in 2005, to 60,000 titles... largely due to the ease with which artists can now record and release their own music."

2. Cut the costs of consumption by democratizing distribution.

For content that consists of bits (e.g., MP3s, videos, ...), everyone who creates can also be a distributor — for example via uploading their video production to YouTube.

Over decades and billions of dollars, Wal-Mart set up the world's most sophisticated supply chain to offer massive variety at low prices to tens of millions of customers around the world. Today anybody can reach a market every bit as big with a listing on eBay.

3. Connect supply and demand.

This consists of the way by which customers become aware of and are able to efficiently search the variety of niche goods. For example, iTunes recommendations, Netflix recommendations, Amazon recommendations, blogs, customer reviews, and so on. All of these things drive demand down the Tail. "Because it's now so easy to tap this grassroots information when you're looking for something new, you're more likely to find what you want faster than ever. That has the economic effect of encouraging you to search farther outside the world you already know..."

Toolmakers, producers: how we make stuff

Populates the Tail.

Aggregators: how we distribute stuff (e.g., eBay, Amazon, ...)

Makes everything in the Tail available.

Filters: how we find stuff (e.g., Google, blogs, recommendations, best-seller lists)

Helps people find what they want.

Chapter 5: The New Producers

The open-source movement has been a software phenomenon where lots of people voluntarily pitch in. But this same phenomenon is occurring outside of software with things like spotting stars since astronomical sitings requires a viewer have his telescope focused on a particular part of the sky, it's easily parallelized with different folks training their telescopes on different parts of the sky. No expertise required.

The pushing down of the tools of production from the few elite to the masses has meant that there are now many more creators. "Don't be surprised if some of the most creative and influential work in the new few decades comes from this Pro-Am class of inspired hobbyists, not frmo the traditional sources in the commercial world. The effect of this shift means that the Long Tail will be populated at a pace never before seen." (Pro-Am is a term which describes Pros working alongside of Amateurs.)

He also talks about the shift from elites generating authoritative content within a controlled framework to the current model where systems are self-governing, as with Wikipedia: "now we're depending more and more on systems where nobody's in charge; the intelligence is simply 'emergent,' which is to say that it appears to arise spontaneously from the number-crunching." The more people who participate, the more the law of large number is at play, which means the better the quality overall. In such probability-based systems, "order arises from what appears to be chaos, seemingly reversing entropy's arrow." Such systems also improve over time.

With these statistical systems, there is only a statistical level of quality: some things will be great, some will be mediocre, some will be very bad. Thus, the quality range is broader in these systems than it would be in traditional governed systems, in which all output will exceed at least a minimum quality threshold. With traditional systems, the range of material covered is much less broad as is the depth at which each is covered. This is for "bandwidth" issues: only so much space can be used and only a finite number of creators. These constraints are not in play with a Wikipedia-like system.

This is the world of "peer production," the extraordinary Internet-enabled phenomenon of mass volunteerism and amateurism. We are at the dawn of an age where most producers in any domain are unpaid, and the main difference between them and their professional counterparts is simply the (shrinking) gap in the resources available to them to extend the ambition of their work. When the tools of production are available to everyone, everyone becomes a producer.

Why do people create content for free? Expression, Fun, Experimentation, Reputation. Some artists are using free content as a way to build an audience that will eventually (hopefully) become paying customers for the wares on display (or for future, related wares). See Lulu.com, which is a do-it-yourself publisher... just about anyone can get published now. Even traditional booksellers are using print-on-demand, which entails printing batches of a book title when sufficient demand warrants it (rather than at the outset printing a humongous batch). With print-on-demand, there is a smaller gamble taken. Print-on-demand converts a physical market into a digital one, freeing booksellers from occupying warehouse shelves with a lot of physical goods.

He also talks about the "increasingly frictionless mobility in the Long Tail," which lets content at the bottom of the Tail move to the top of the Tail quickly and easily. Then he talks about Andy Samberg and his crew and how they broke into SNL. "Think about how many of those potential talents now have a chance to find a real audience, thanks to the democratized distribution of the Internet."

Chapter 6: The New Markets

Consider the used books market. Previously, it was a very difficult market to search the full breadth of. One could go to his local used bookstore which had, more or less, a random collection of used books. One couldn't find a particular used book very easily, though surely somewhere that book was available for resale (or would be if the owner knew there was a buyer). Richard Weatherford developed Alibris to make this market more efficient. Each used-book seller entered his inventory into the database, enabling a person to easily search across the inventory of thousands of used-book stores.

By bringing millions of customers to the used-bok market, this gave used-book stores even more incentive to computerize their inventories, which, in turn, gave Alibris... even more inventory to sell. It was a classic virtuous circle, and the effect supercharged used-book sales...

Alibris is a Long Tail 'aggregator' — a company or service that collects a huge variety of goods and makes them available and easy to find, typically in a single place. What it did by connecting the distributed inventories of thousands of used-book stores was to use information to create a liquid market where there was an illiquid market before... it tapped the latent value in the used-book market. And it did it at a tiny fraction of the cost that it would have required to assemble that much inventory from scratch, by outsourcing most of the work of assembling the catalog to the individual booksellers, who type in and submit the product listings themselves.

Other aggregators (or entities that provide access to something that the entity itself does not own): Google, iTunes, Netflix, eBay, ...

There are a couple categories of online aggregators: those that sell physical goods online and those that sell digital goods online. The aggregators that sell physical goods can offer incredible variety but at some point they hit a space limitation (for storing the goods). (But what if you outsource storage to individual micro-proprietors like Amazon's Marketplace effectively does? Then you don't run into the physical storage limitations, right?) By contrast, the aggregators that sell digital goods can sell everything all the way down the Tail. Distribution costs in the physical goods case are cost-to-ship and in the purely digital model it's merely broadband megabytes. Pure digital retailers can choose between selling goods as standalone products or as a service. So in the digital case, we have "the holy grail of retail — near-zero marginal costs of manufacturing and distribution."

As this program [Amazon Marketplace] continues to grow, Amazon gets closer and closer to breaking the tyranny of the shelf entirely. It doesn't have to guess ahead of time where the demand is going to be, and it doesn't have to guess at how big that demand will be. All the risk within the Marketplace program is outsourced to a network of small merchants who make their own decisions, based on economics, on what to carry.

On Amazon getting closer to economic nirvana: Amazon is using print-on-demand, as described above, which stores books as digial files until they're purchased at which point they're printed on laser printers. "Now Amazon can retain an inventory that takes up no space and has no cost at all: These books and movies remain files in a database somwhere until they're ordered." The biggest cost to publishers is the cost of returns from booksellers. With print-on-demand, this goes away — "potentially reducing returns radically."

"The ultimate cost reduction is eliminating atoms entirely and dealing only in bits. Pure digital aggregators store their inventory on hard drives and deliver it via broadband pipes... Because the goods are digital, they can be cloned and delivered as many times as needed, from zero to billions. A best-seller and a never-seller are just two entries in a database; equal in the eyes of technology and the economics of storage."

Chapter 7: The New Tastemakers

A few stories of how tech is being used in music and how this is changing the music industry.

First, Bonnie McKee was a new singer. And there were questions about how to market her, what audience (demographic) to target. The studio thought she was a natural fit for adult contemporary, women ages 25 - 35. To validate this hypothesis, they did a limited release of her songs using LAUNCHcast, which is an online radio station. A LAUNCHcast user has a playlist created specifically for him (or her) based on his/her ratings of previously listened to songs — i.e., an adaptive recommendation system. The label paid for placement and promotion (on LAUNCHcast) — essentially buying an initial audience but also buying that audience's feedback on McKee's songs. Also, LAUNCHcast is part of Yahoo!, which has lots of demographic info about its users, which could be provided back to the label along with the ratings. Through this experiment, the studios learned that in fact the audience which responded best to McKee's songs were girls ages 13 - 17.

The second story is about a band called My Chemical Romance whose label established a presence for the band on two music-heavy social networking sites. Released a couple singles to limited audiences then used file-trading data (which tracks were the most searched? downloaded?) to figure out which singles to release in what order (release the hottest one now). The band also leveraged its online audience to real-world effects, namely by encouraging audience members to contact radio stations and ask them to play the band's songs. The band's fans were given access to exclusive tracks, access that they repaid via heavy word-of-mouth on behalf of the band.

The third story is about a band that was able to displace the role of the music label entirely by using resources on-line (and because of technological advances which drove down the cost of recording studio time). Birdmonster used the Internet to identify possible gigs and to build a base of support (via MySpace). This base was leveraged by the band in its attempts to secure gigs — to demonstrate that the band would be able to draw a crowd. Birdmonster recorded three tracks at a local independent studio then self published this mini-album via CDBaby. CDBaby takes albums on consignment and sells them online. CDBaby transferred the tracks to iTunes to sell. The band drummed up free marketing by sending tracks to MP3 blogs. As a consequence of this attention, buzz, and sales, Birdmonster started getting offers from established recording labels — offers which Birdmonster turned down.

The traditional role of a music label is to scout talent and provide financing, distribution, and marketing. Because of advances in digital recording technology, the price of studio time was lower than ever — about $15k for an entire album. So financing wasn't a problem. And the band was able to get distribution via CDBaby and Cinderblock which released the tracks via iTunes, Rhapsody, etc. Finally, the band got free high-credibility marketing from various music blogs. Technology has shifted the balance of power from the label to the band.

Yahoo! music ratings, Google PageRank, MySpace friends, Netflix user reviews — these are all manifestations of the wisdom of the crowd. Millions of regular people are the new tastemakers. Some of them act as individuals, others are parts of groups organized around shared interests, and still others are simply herds of consumers automatically tracked by software watching their every behavior.

As the culture fragments from a single mass market to millions of niche markets, we see the rise of microcelebrities — that is, celebrities for each such niche; people whose opinions are respected when it comes to fashion, music, and so on. The slice of the world over which such people have authority can be as narrow as a niche is.

The resources that help a person locate stuff that person might like are filters, which "sift through a vast array of choices to present you with the ones that are most right for you. That's what Google does when it ranks results..." This is not dissimilar from the role that the Yellow Pages played in years of old. Filters have the effect of driving demand down the tail as with Netflix's movie recommendation system which results in most Netflix rentals being "back catalog" (rather than new releases). As Reed Hastings says by way of explanation: "it's not because we have a different subscriber. It's because we create demand for content and we help you find great movies that you'll really like."

As an example of the types of filters that people use in perusing an online music store. You can choose a category (niche) and will be presented with a best-seller list for that category. If you click on an artist or song within that category, you might be presented with related artists and music. You can also see an artist's followers who have perhaps written reviews and created playlists. Finally, you can try a custom radio station tailored around a particular artist or micro-genre. Tehse are all ways of weeding through the vast database of music in the Tail.

There are some limitations to current music filtering. In particular, the same sample format is presented for all genres of music (the 30-second portion of a track) even though this sample may not be equally helpful in evaluating music from all genres. Also, even just the listing format for a song: artist, album. This is specialized for the pop-music domain and may be less appropriate for classical music which has a composer and separately a performer. Ideally, the way things are presented for each niche is specialized to that niche. Bestseller lists across genres are not helpful in the way that rankings within a genre are.

Since the Long Tail contains so much stuff (the vast majority of stuff), the filters to navigate it must be better than those needed to navigate the head, which just contains a comparatively few items. Noise is a big problem in the Tail. Because hit products are tailored to have the greatest audience, they are lowest common denominator. What this means is that the products in the head won't necessarily be the best products for you. Because the products in the head have specifically been designed to be the least offensive, the most vanilla. Hence, the best products for you (the ones that most closely match your specific tastes) are likely to be in the Tail somewhere. And there is a wider range of quality in the Tail generally than there is in the head.

"As the Tail gets longer, the signal-to-noise ratio gets worse. Thus, the only way a consumer can maintain a consistently good enough signal to find what he or she wants is if the filters get increasingly powerful."

There are pre-filters, which determine what gets made and how (i.e., shape the development of a product), and post-filters, which determine how well-liked the made things are. A pre-filter attempts to predict what will best appeal to audiences whereas a post-filter merely amplifies how audiences feel about a particular thing; "in Long Tail markets, the role of filter shiftsw from gatekeeper to advisor. Rather than predicting taste, post-filters such as Google measure it."

Chapter 8: Long Tail Economics

80-20 rule: demonstrated variously; first witnessed in wealth distribution (20% of the people had 80% of the wealth). Zipf noted that the frequency with which a word appeared was proportional to one divided by that word's rank (2nd item occurred half as much as the first; 3rd item occurred one third as much as the first; and so on). These are both examples of power law distributions. The conditions that lead to power laws in consumer markets are: (1) variety of products, (2) inequal quality of products, and (3) network effects that amplify the differences in quality. These result in a "predictable imbalance" of markets and culture... Success breeds success.

If you plot a power-law distribution on a log-log scale (both the x and y axes are logarithmic) then you get a straight line with negative slope (the steepness of which depends on the particular distribution). When we look at box office receipts, however, we note that instead of seeing a straight line across, we see a straight line that steeply drops off. So the line is in effect truncated. What happened? Do box office receipts NOT follow a power-law distribution?

They do. What we're seeing is the effects of the constrained carrying capacity of the US theatrical industry. That is, the movie screens across the country can only carry so many movies; in particular, around 100. There might be some variation in which movies are carried by which theatres (e.g. contrast a megaplex in a Midwestern city to an art house theatre in the East Village) so we see movies ranked 100 - 500 getting box office receipts that obey the power laws. But movies ranked 500 to 13,000 have no theatrical distribution. Hence the steep drop off in box office receipts. "The lesson is that what we thought was a naturally sharp drop-off in demand for movies after a certain point [rank] was actuallly just an artifact of the traditional costs of offering them." What we're starting to see is non-theatrical distribution channels (e.g., the Internet and direct-to-DVD) are becoming major markets.

The 80-20 rule is an example of a Zipf distribution and has classically been applied to products and revenues (20% of products account for 80% of revenues). Since carrying costs for hits and misses were the same (and substantial), this meant that a substantially greater share of profits came from hits (whereas the sales of misses merely covered their costs but yielded no margin). With lower carrying costs of inventory, the 80% of products become more profitable and so proprietors would do well to offer them. A Long Tail retailer, by contrast, carries much more inventory and at cheaper costs. The 20% of traditional retailers (e.g., 1000 albums) represents just 2% of LT retailers (since LT retailers carry much more variety — many more different products). This 2% accounts for 50% of sales, the next 8% accounts for 25% of sales, the bottom 90% accounts for 25%.

Consider new-release DVDs which big-box retailers (e.g., Best Buy) sell at a loss when they first come out. Why? Because they are a loss leader, a product sold at or below cost to stimulate other profitable sales (i.e., they bring people into the store where those people buy more/other stuff as well). Also DVD manufacturers allow unsold new releases to be returned, which lowers the risk to retailers. An effect of this market is that Best Buy and company set the prices for these new releases, even for other businesses which have different models — e.g., Blockbuster which doesn't sell many other products (and hence for which the loss leader advantage is lost).

Blockbuster and Co. want to create a market that isn't so depeendent on new releases then BECAUSE new releases are not profitable for them. The way to do this is to drive demand down the curve. Long Tail products are cheaper to acquire (DVD manufacturers charge more for new releases and less for older titles) and hence can be very profitable — as long as inventory costs are kept close to zero. So, LT retailers: (1) offer many more products, (2) sales are spread more evenly between hits and niches, and (3) there are profits at all levels of popularity.

Does a LT mean that the hits at the head sell less? More? The same?

What shifts demand down the Tail? (1) Variety, (2) Lower search costs — easier to navigate this increased variety in order to find what you want, and (3) sampling — able to "test drive" a purchase (read a chapter of the book, listen to 30 seconds of the song) — which reduces the purchase risk.

Quantifying the effect of lowered search costs for the same inventory: compare online to catalog sales. Both have same inventory; easier to navigate the online inventory. The effect was that customers tended to buy further down the Tail online. In the catalog, 20% of the products accounted for 84% of sales. Online, 20% accounted for 71% of sales.

Measuring the effect of having more selection available: compare a retailer with unlimited shelf space with one which has limited shelf space. Used industry-wide data fror bricks-and-mortar entertainment purveyors. Compared to Rhapsody and Netflix. The results were that the online demand curve is much flatter: the average niche music album titles sold twice as many copies online as offline; the average niche DVD sold three times as many copies online. The top 1000 albums accounted for 80% of the offline market and less than 1/3 of the online market.

Does TLT grow the pie ior just slice it differently?

Some forms of entertainment are non-rivalrous, i.e., you can do something else while you consume them. You're more likely to consume more if it doesn't cost you more to do so. This means that subscription services may be more likely to drive demand down the tail as consumers can explore niche markets without paying more. It may be that users will be more satisfied with their niche purchases than they were with hit products (because the niche product more closely matches the consumer's preferences than one-size-fits-all blockbusters) and furthermore that this increased satisfaction will yield more sales (i.e., grow the pie).

The effects of TLT on pricing? See reading.

Microstructure in TLT

Power laws are fractal — no matter how far you zoomin, they still look like power laws; also described as "self-similarity at multiple scales." The music market is made up of thousands of niche micromarkets or miniature ecosystems, each of which has its own head and tail.

How come the powerful feedback loops that yield 80-20 distributions do NOT have the same effect in TLT — i.e., do not make hits ever more popular and niches ever less so?

Because we've noted that what we see in TLT is a flattened power-law (rather than exaggerated), that is, there is less difference between hits and niches. The answer is that the recommendation networks have more muted effects between or across genres than they have within a genre. "Popularity exists at multiple scales and ruling a clique doesn't necessarily make you the homecoming queen."

TLT of time: new things sell better than old things.

Since TLT is about abundance and economics is based on the premise of scarcity, does that mean that everything we know about econ no longer applies? That is, two of the main scarcity functions of traditional economics are the marginal costs of manufacturing and distribution; these functions are trending to zero in LT markets of digital goods. Anderson claims that since the overall system is still constrained by scarcity, the traditional laws of economics are still in play. Gilder: "In every industrial revolution, some key factor is drastically reduced in cost. Relative to te previosu cost to achieve that function... physical force in the industrial revolution."

That suggests a way to put this in an economic context. If the abundant resources are just one factor in a system otherwise constrained by scarcity, they may not challenge the economic orthodoxy.

Chapter 9: The Short Head

Online shopping still accounts for less than 10% of American retail. There are tactile and instant-gratification advantages of bricks-and-mortar establishments. Also, there will always be blockbusters; "For each way that we differ from one another, there are more ways that we are alike." Hits still have unmatched impact and are a source of common culture. LT aggregators need to provide the full range of variety, "from the broadest appeal to the narrowest, to be able to make the connections that can illuminate a path down the Long Tail that makes sense for everyone" (i.e., familiar head products give customers a toe-hold into the market which enables then exploring products on down the tail). And because "Customers love one-stop shopping." iTunes succeeded in part because it provided a critical mass of mainstream music.

How come MySpace — which focuses mostly on independent music — still works, despite its non-provision of mainstream musicians? Because of its combination of social networking with music, which has the effect of keeping both fresh (don't have burnout of pure social networking where folks are connecting just for connection's sake).

Population clusters also exhibit power laws.

138 million Americans shop at Wal-Mart every week.

The Short Head: Wal-Mart carries only 4500 unique CD titles (Amazon carries 800k). Wal-Mart accounts for 20% of all music sales in America (presumably this includes both their online and physical sites). Wal-Mart has something in about every category of product but not much (broad and thin as opposed to deep and narrow); they provide "a veneer of variety."

Challenge of physical goods: they force us into crude categorization. A can of tuna cannot simultaneously be located in all different categories that it might be in; e.g., health food, canned goods, fish, protein, low-fat, casserole components, and so on. Have to guess at what category most people would use for tuna fish (and for every other product). "With the evolution of onlione retail, however, has come the revelation that being able to recategorize and rearrange products on the fly unlocks their real value." "The efficiency and success of online retail hyave illuminated teh hcost of traditional retail's inflexibility and taxonomical oversimplifications." The tricky question of where to put things: the ontology problem.

On the other hand, think about a world of ad-hoc organization, determined by whatever makes sense at the time. That's more like a big pile of stuff on a desk instead of rows of items stringently arranged on shelves. Sure it may seem messy, but that's just because it's a different kind of organization: spontaneous, contextual order, easily reorderedinto a different context as need be... a world of infinite variety and little predetermined order; a world of dynamic structure, shaped differently for each observer.

The economics of broadcast are such that one can reach a million people as easily as one (for the same cost); and revenues (e.g. from advertising) are variable. There are constraints: only 24 hours in a day, only so many channels per cable. So the way to make money is to get a big enough audience to make the most from that slot. "The general principle of the last hundred years of entertainment economics was that content and distribution were scarce and consumer attention was abundant... It was a sellers' market and they could afford to waste attention." This led to ad clutter; in the mid-1980s, ad time per hour went from 6 minutes 48 seconds to 12 minutes 4 seconds.

More recently, people have started to watch less television. Or rather, people brought up on the Internet watch less television than their predecessors did at their age. "The audience is migrating away from broadcast to the Internet, where niche economics rule."

Chapter 10: The Paradise of Choice

Huge explosion in variety available. Why? Globalization and hyperefficient supply chains. The variety of goods imported increased by a factor of 3 from 1972 to 2001. Another answer: demographics. We are more diverse and people want to be special rather than wanting to achieve some singular common ideal (e.g., being like the Joneses).

But growing chorus that too much choice is no good; it's oppressive because we can't navigate it and become stressed by the fear of making the wrong choice (something that increases in likelihood as more options are made available). The solution is to help users navigate the array of choices so that they can be confident in their particular selection. The navigational support can come in the form of different orderings of stuff: by price, ratings, date, genre. Can determine what "people like you" have preferred (bought). The limitation of the supermarket shelf is the absence of this extra ordering information, which is simply physically more difficult to provide to the customer in this environment vs. in the online environment. This is the resolution of the paradox of choice, which held that consumers loved choice but that having more of it didn't actually seem to make them happier: we not only need choice, we also need information about the various options, so as to increase the likelihood that we'll be happy with our ultimate selection.

The more choice we have, the more we have to decide what we really want. The more we reflect on what we really want, the more involved we get in the creation of the goods we buy and use [via customization]. The more we participate in the creation of products and services, the more choices we end up creating for ourselves.

Does more choice make people buy more, though? Results are inconclusive.

Chapter 11: Niche Culture

House music (the Rave scene) was a reaction against the bankruptcy of blockbuster culture. The spread of affordable technology — mixing decks and multi-track recorders — let DJs produce house music. Clubs and warehouses served as democratized distribution channels. The market fragmented into hyperspecialized genres: deep house, funky house, dub house. Needed a mechanism to navigate this new landscape (i.e., serve as a filter) — the label was used to fill this need. House music producers opened up their goods to be remixed and tweaked: "A house reord that does well often attracts remixes from other producers; it becomes a kind of platform... As the number of complements increases, the value of the platform track snowballs."

Seeing a shift from mass culture to massively parallel culture, where lots of little tribes or microcultures and each of us belongs to many different ones. Almost all of us, though we may be pretty mainstream, "goes super-niche in some part of our lives." That is, Postrel: "Most of us cluster somewhere in the middle of most statistical distributions. But there are lots of bell curves, and pretty much everyone is on a tail of at least one of them." Anderson: "This has always been true, but it's only now something we can act on. The resulting rise of niche culture will reshape the social landscape. People are re-forming into thousands of cultural tribes of interest, connected less by geographic proximity and workplace chatter than by shared interests."

We've seen this in the news industry; "In effect, blogs pick off the mainstream media's customers one by one by being niche where their old-media precursors are mass." A concern about whether this will fracture our relationships to one another — since we are no longer consuming the same mass culture. Anderson claims that "What we've lost in common culture, we've made up in our increased exposure to other people [who populate the niches we explore]... Rather than being loosely connected with people thanks to superficial mass-cultural overlaps, we have the ability to be more strongly tied to just as many if not more people with a shared affinity for niche culture."

Chapter 12: The Infinite Screen

Chapter 13: Beyond Entertainment

eBay: the Long Tail of merchants and of products. Has sales volume equivalent to Wal-Mart's. Distributed inventory: it's just a web site where buyers and sellers meet and agree upon a price. Inventory costs: $0. Self-service model: merchants create own product listings, handle own packaging and mailing. So eBay has very few employees relative to its revenues; it has around $5M in revenues per employee (Wal-Mart has around $170,000 in revenues per employee). Also, eBay provides filters (search, multi-level category stucture) to help buyers navigate.

eBay is the largest used-car dealer and largest seller of automotive parts. It extends from the head to the tail. But eBay is not perfect. It doesn't have a standardized product-description framework and so it doesn't have knowledge of what's being sold on its site — knowledge that could be navigated by a computer program. For this reason, eBay isn't able to build product recommendation systems and so on, which would presumably drive demand even further.
KitchenAid: mixers come in various colors. In making all of them available to consumers, a Long Tail emerged.
LEGO: has a mail-order business that caters to enthusiasts. At least 90% of its products are not available in traditional retail outlets, which accounts for about 10 - 15% of LEGO's sales. These sales have higher margins, though (no retailer to share profits with).
Salesforce.com: rather than ship software that was installed in the usual way and used to manage sales contacts, Salesforce offers software as a service. Customers navigate their browsers to the site and effectively run the software on Salesforce's servers. Meant that businesses could save on tech expenses (relating to administering and maintaining software installations). Benioff responded to the threat of Oracle and Co me-too'ing him to death by opening up his platform to other independent developers to develop their own software as a service and make it available to customers.

The head of the software business is Microsoft. The costs of writing software plummeted, the cost of delivering software decreased as move from CD-ROM to download model. Now can also navigate the world of software products more easily too.
Google: the traditional advertising market — classic, hit-centric industry. The ad-business historically entailed a lot of expensive schmoozing because: "a lack of trusted performance metrics makes salesmanship and personal relationships key to winning business" (I quote this because it precisely describes the type of field I wanted completely to avoid in my own career.) Ad sales people focus on the largest and most lucrative of potential advertisers: "In other words, the system is biased toward the head of the advertising curve."

"As with every other makret we've looked at, that head is just a tiny fraction of the potential market. But because it's so expensive to sell advertising the traditional way, the smaller potential advertisers have been left to their own devices, mostly picking up a phone and placing a classified ad or sending some homemade display copy to the local newspaper. That's pretty much how advertising has worked for most of the past century."

"What Google realized is that if it could take most of the cost out of both selling and buying advertising, it could dramatically increase the pool of potential ad buyers and sellers. Software could do almost all of the work, therby lowering the economic barrier to entry and reaching a much larger market."

Google's ad model: (1) is based on keywords, which is a Long Tail; (2) dramatically lowered the cost of reaching the market — use a simple and cheap self-service model; (3) provided the same benefits to publishers — allowing them to be a platform on which ads could be served the same way that traditional big newspapers were in the past. The pages on which ads are served (i.e., the publishers' sites) are cost-free inventory for Google. So the opportunity costs are born by the publisher, not Google.

Google expanded its search options from keyword and text to other vertical categories: Maps, Scholar, Products, News, Books, Video, and so on. These are like different niches. Search results are customized in a way that makes sense for the category; e.g., video search results would include a clip, for books — a snippet, for images — a thumbnail, and so on.

Scratch Paper

Thursday, October 8, 2009

I also really like these companies that create new ad-hoc employment marketplaces

Wednesday, October 7, 2009

Some interesting companies (just the gist), 10/06/2009

Monday, September 21, 2009

So I recently added Google Analytics code to this blog...

Tuesday, September 15, 2009

(some) Google Technology, briefly: PageRank, MapReduce, Bigtable

Saturday, August 15, 2009

Notes on The Long Tail by Chris Anderson

Posts

Liz Stinson

Search This Blog