Tuesday, September 15, 2009

(some) Google Technology, briefly: PageRank, MapReduce, Bigtable

Below are some brief notes, meant only to capture the Big Picture. This simplification may come at the cost of precision and even accuracy. My apologies to the creators. By way of summarization, Google's philosophy is "in-house development designed to run on commodity hardware."

PageRank:
One signal of many. Examine entire link structure of the Web. Identify which pages are most important. Note that this is not the same thing as identifying which pages are most important relative to some query or within a given vertical. I think it's interesting that they use this overall importance measure rather than identifying importance within a given domain. Then if it's important to have an overall metric, assign relative importance to each domain then calculate each page's resulting overall importance. In any case, from my incredibly cursory reading, it appears this is not what they do. They have an overall importance metric. Separately, they perform hypertext matching to compare Web documents to the query string.

The rationale behind my computing "overall importance" within a domain is that what seems unimportant one day can be quite important the next, e.g., consider pages dealing with "anthrax vaccines" pre-anthrax-letter-mailing and post-. The relative importance of such pages increases dramatically. And so by tweaking relative importance of various domains, we can automatically adjust the relative importance of all documents within those domains. But certainly this approach would have all sorts of challenges of its own, such as what the hell are the domains? How to handle documents that cross domains? Can domain identification be automated? And so on.

In any case, your search results are obtained by considering the relative importance of all pages on the Web and combining that information with which pages best match your particular search query. So it seems that the hypertext analysis is used to determine which documents match and the "overall importance" metric is used to rank the matching documents. So is the idea that all documents match equally? Probably not; probably they use buckets. So all documents in this bucket match approximately equally; hence, use "overall importance" to rank the documents within this bucket. Do the same for the next bucket and so on. Ranking of results across buckets could be done by using all results from the first bucket (which perhaps maps the most closely) followed by all results from the second bucket (which maps the second most closely) and so on. Or something more sophisticated where the results from various buckets are interleaved.

Note that the parsing they do of Web pages (in order to build inverted indices which will be used to perform hypertext-matching) does not just scan the page's content. It's more sophisticated, taking into consideration fonts, the way a page is organized or divided, where a word occurs within a page (early on or later). This is meant, in part, to frustrate those who would manipulate results by inserting metatags which suggest that a page contains data that it does not actually visibly present.

MapReduce:
The problem that MapReduce was designed to solve was that Google (and others) had large sets of input data that they needed to perform computation over. If they ran that computation on a single computer, it would take a very long time. So they needed a way to break up the computation into smaller chunks (i.e., parallelize it), distribute those chunks to multiple computers, then merge the results. The raw data in this case might be, for example, crawled documents or web request logs. And the computation or processing might be to: (1) identify the most frequent queries, (2) represent the crawled documents graphically, (3) summarize the number of pages crawled by each host, or other tasks. The processing or computation is straight-forward but the input data set is HUGE and so it has to be split across multiple machines; otherwise it would take forever.

There are a number of questions: How to parallelize this computation? How to distribute the (input) data (to the various machines)? How to handle failures?

So MapReduce is an abstraction that was introduced to allow solving common problems (such as how to handle failure) one time  in the MapReduce implementation  rather than having this machinery be reinvented for each different problem instance. MR "hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library."

The various tasks that they were needing to solve in a distributed manner involved a couple common components. That is, the meat of those tasks could be broken up into two phases: (1) MAP: apply a function to each input item in order to get a set of intermediate key/value pairs; this function is provided by the user and essentially decomposes the input data in a way that makes sense given the target computation; (2) REDUCE: given several sets of intermediate key-value pairs (i.e., one set is generated by each machine participating in the computation and acting on some input data), merge the key-value pairs that have the same key.

As a practical example, consider the problem of identifying the most frequently queries given a set of web request logs. Maybe we'd give one web request log to each participating machine. Then each machine would parse that log using the MAP function, which would result in the machine having a set of key-value pairs where the key is the query string and the value is the frequency with which that string appears in this web-request log. In this case, REDUCTION is trivial; in particular, reduction involves merging all machines' results, which means that  for each machine which has a key in common  we add those machines' values for that key to obtain the key's ultimate value (i.e., total frequency). Remember that the "value" in this example is the frequency with which a query string (the key) appeared in the given machine's input data (web request log). Hence, REDUCE here is arithmetic addition but you can imagine a case where the combination of intermediate computation results is more complicated.

Note then that the steps are as follows:
  1. Invoke MAP on an input pair; results in MAP returning a set of intermediate key-value pairs.

  2. MR library obtains these key-value pairs from all participating machines. Then MR library looks at this aggregate set of intermediate key-value pairs (those contributed by each machine) and identifies common intermediate keys.

  3. For each common intermediate key identified in the previous step, invokes REDUCE  supplying the common intermediate key I and all (intermediate) values for I. This function combines those values to obtain a single output value (corresponding to the given input key). Note that in principle, REDUCE could result in merely a different set of values (rather than being required to produce a single aggregate output value); that said, "typically just zero or one output value is produced per Reduce invocation."

    Note that the way that the values (for the given intermediate key) are supplied to Reduce is via an iterator. This makes it OK for an intermediate key to have a set of associated values that is too large to fit into memory.
How to handle failure? Re-execute. In particular, failure is presumably a single machine dropping dead (or some small number of machines). And since the computation on each such machine is independent of that on other machines, we can restart or re-execute the computation that occurred on the failed machine without having to restart the entire job.

MR is a widely applicable model:


MapReduce "is often used for generating and modifying data stored in BigTable[2], Google Reader,[3] Google Maps,[4] Google Book Search, "My Search History", Google Earth, Blogger.com, Google Code hosting, Orkut[4], and YouTube[5]."

References:
Bigtable:
It's a database and its implementation is built on top of Google's File System (GFS), Scheduler, Lock Service, and MapReduce. Non-Googlers can use Google's Bigtable implementation by developing programs for Google's AppEngine which use the engine's datastore (which sits on top of Bigtable). "Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth."
Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).
Motivating example: Webtable, "a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched..." This retention of time-modified information enables versioning.

A table does NOT have a fixed number of columns. Each table has multiple dimensions; it might have a row for each URL and columns corresponding to: the content stored at that URL at some time, the language of that content, ... Each cell would have an associated time as well, making every two-dimensional table at least a three-dimensional table.

Each table is split  at row boundaries  into multiple tablets. That is, a range of rows is dynamically partitioned into a tablet. So a tablet might contain some number of rows; all of a row's contents can be found within a single tablet. Tables are divided into tablets in order to get tablets of size approximately 100  200MB. Each machine stores around 100 tablets (so 100 * 200MB == 20GB) [according to notes from a talk in October 2005, so that may have changed substantially as the cost of disk space continues to drop]. This sizing is done to optimize Bigtable for the underlying file system implementation and:
This setup allows fine grain load balancing (if one tablet is receiving lots of queries, it can shed other tablets or move the busy tablet to another machine) and fast rebuilding (when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get new tablet, but the load on each machine to pick up the new tablet is fairly small).
Special tablets keep track of the file system locations of data tables (akin to how a directory structure keeps track of where particular file blocks are stored on disk). These "meta-tablets" keep a mapping between tablet ID and location within the GFS. If a tablet or segment exceeds the 200MB limit, they use compression to reduce the segment's size.

Bigtable Data Model
A table is really a multi-dimensional map which is indexed by row key, column key, and timestamp. Each value in the map (i.e., cell contents) is an uninterpreted array of bytes. The map is distributed and persistent, which means that different parts of it may be stored permanently on different machines in different locations. It's also sparse (though this seems to me more a function of the table's contents than of the table itself), which means that lots of cells have no values; for example, if each row corresponds to a URL then there are many more columns (each corresponding to a feature) than most URLs exhibit. As a contrived example, imagine that there was a column for each possible language. Then a row (URL) will only have an X in the column corresponding to the language in which this URL's content is written. Hence, all other language columns would be empty for this row.

The Bigtable paper represents a table as having three dimensions; hence, one can access a cell value by providing the string identifying the row, the string identifying the column, and the 64-bit integer representing the timestamp. Alternatively, by just providing the row and column, one obtains an array of values contained in that cell at different times (value v0 at time t0, v1 at t1, and so on).

Dynamic Control over Data Layout and Format: There is a claim that clients can specify the locality of data (namely, Bigtable "provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage."). In particular, Bigtable stores data in lexicographic order by row key. Then the row keys that a client uses determine which data will be stored together and hence enable the client to group data that should be accessed together in adjacent rows. I'm not sure how frequently the "dynamic partitioning" of a table into tablets occurs (whether it's something that's done once up front and retained thereafter OR IF INSTEAD a table might be repartitioned as the contents of various rows grow/shrink).
For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.
Sample Application  Google Analytics, which:
provides aggregate statistics, such as the number of unique visitors per day and the page views per URL per day, as well as site-tracking reports, such as the percentage of users that made a purchase, given that they earlier viewed a specific page.
To enable the service, webmasters embed a small JavaScript program in their web pages. This program is invoked whenever a page is visited. It records various information about the request in Google Analytics, such as a user identifier and information about the page being fetched. Google Analytics summarizes this data and makes it available to webmasters.
The Bigtable for this application consists of one row per end-user session, where the row is named by a tuple consisting of the website name followed by the time the session was created. Then sessions that visit the same site are contiguous in the Bigtable (which is kept sorted lexicographically by row name, recall) and are sorted chronologically. A summary table (approximately 20TB) is generated periodically for each such "raw click table" by running various MapReduce jobs which extract info from the Bigtable (approximately 200TB).

Other applications: Google Earth and Personalized Search.

A lesson they learned in building Bigtable was not to add new features until there's a use case for them.

References:
Tangential items that came up in writing the above:
  1. Inverted indices: a non-inverted (or forward) index contains a list of documents and, for each document, identifies the words that appear in that document. An inverted index then lists the words (i.e., is indexed by) and identifies which documents each word appears in. So an inverted index maps from content to location. There are two types of inverted indices: (a) a record level inverted index (or inverted file) provides, for each word, references to all documents containing that word. The second type is a word-level inverted index (a.k.a. inverted list or full inverted index) which lists, for each word, the documents in which that word appears and, for each document, the position(s) within that document at which the word appears.
    http://en.wikipedia.org/wiki/Index_(search_engine)
    http://en.wikipedia.org/wiki/Inverted_index
    http://en.wikipedia.org/wiki/Full_text_search

  2. Functional (programming) languages:




2 comments:

  1. What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up
    click here

    ReplyDelete
  2. This is a smart blog. I mean it. You have so much knowledge about this issue, and so much passion. You also know how to make people rally behind it, obviously from the responses. blogger temaları

    ReplyDelete