![]() |
|
|
#1 |
|
Junior Member (25+)
|
Guys... This article describes how efficiently Google manages such huge data and fast search results.
The key to the speed and reliability of Google search is cutting up data into chunks. Google Machinery: To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy. Its commodity servers cost around $1,000 a piece, and Google's architecture places them into interconnected nodes. All machines run on a stripped-down Linux kernel. The distribution is Red Hat but Google doesn't use much of the distribution. Moreover, Google has created its own patches for things that haven't been fixed in the original kernel. The downside to cheap machines is that they must be made to work together reliably. These things are cheap and easy to put together. The problem is, these things break. In fact, at Google, many will fail every day. So, Google has automated methods of dealing with machine failures, allowing it to build a fast, highly reliable service with cheap hardware. The Search: Google replicates the Web pages it caches by splitting them up into pieces it called "shards". The shards are small enough that several can fit on one machine. And they are replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. These servers are called chunk servers. As a search query comes into the system, it hits a Web server, and then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box. In parallel, clusters of document servers contain copies of Web pages that Google has cached. The refresh rate is from one to seven days, with an average of two days. That's mostly dependent on the needs of the Web publishers. Each set of document servers contains one copy of the Web. These machines are responsible for delivering the content snippets that show searchers relevant text from the page. When the top 10 results are available, they are sent to the document servers, which load the 10 result pages into memory. Then, these pages are parsed to find the best snippet that contains all the query words. The Backbone of Google’s Architecture: Google uses three software systems built in-house to route queries, balance server loads and make programming easier. The Google File System was written specifically to deal with the cheap machines that will fail. All the files are broken into chunks and then distributed randomly across different machines in a way such that each chunk has at least two copies that are not physically adjacent, i.e., not on the same power line or connected to the same switch. Chunks typically are 64 megabytes and are replicated three times. All this replication makes it easier to make changes. Google simply takes one replica at a time offline, updates it, then plugs the machines back in. Because these chunks are randomly distributed all over, Google needs a master containing metadata to keep track of where the chunks are. When a query comes into the system, the file system master tells it which chunk server has the data. From there on, you just talk to the chunk servers. Client machines are responsible for dealing with fault tolerance. If a client requests a file from the specified chunk server and gets no response within the designated time period, it uses the meta information to locate another chunk server, while sending the file master a hint that the first chunk server might have died. If the master confirms the chunk went out, it will replicate the chunks that were on it to another server, making sure that the information is replicated at least the minimum number of times. To enable Google programmers to write applications to run in parallel on 1,000 machines, engineers created the Map/Reduce Framework in 2004. This framework provides automatic and efficient parallelization and distribution. It is fault tolerant and it does the I/O scheduling, being a little bit smart about where the data lives. Programmers write two simple functions, map and reduce, to create a long list of key/value pairs. Then, the mapping function produces other key/value pairs. For example, if an application is needed to count URLs on one host, the programmer would take the URL and the contents and map them into the pair consisting of hostname and This produces an intermediate set of key/value pairs with different values. Next, a reduction operation takes all the outputs that have the same key and combines them to produce a single output. Map/Reduce is a very simple abstraction that makes it possible to write programs that run over these terabytes of data with little effort. The third homegrown application is Google's Global Work Queue, which is for scheduling. Global Work Queue works like old-time batch processing. It schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data. Mostly, huge tasks are split into lots of small chunks, which provides even load balancing across machines. The idea is to have more tasks than machines so machines are never idle. Google uses its massive architecture to learn from data. It analyzes the most common misspellings of queries, and uses that information to power the function that suggests alternate spellings for queries. The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words. To do this, the system tries to cluster concepts into "reasonably coherent" sub- clusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. This kind of machine learning has had little success in academic trials, because they didn't have enough data. If there is enough data, reasonably good answers are obtained out of it. Google's redundancy theory works on a meta level. One literal meltdown -- a fire at a data center in an undisclosed location -- brought out six fire trucks but didn't crash the system. __________________________________________________ _____________________________ Source : KERNEL (CSA Annual Magazine, BITS,Pilani) Article by : D Sriram Editor: Sridatta Chegu (Me )
__________________
|
|
|
|
| Thanked Users: | Dark Star (02-06-2007) |
|
|
#2 |
|
Elite Member (1000+)
Join Date: May 2006
Location: /dev/had0
Age: 19
Posts: 1,577
Thanks: 98
Thanked 170 Times in 146 Posts
Rep Power: 43
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Re: HOW GOOGLE WORKS ??
Awesome never bothered to know that btw thanks a lot BITS gem
![]() ![]()
__________________
My GNU/ Tux Blog : ~TuxEnclave~
|
|
|
|
|
|
#3 |
|
Junior Member (25+)
|
Re: HOW GOOGLE WORKS ??
see how it works
|
|
|
|
|
|
#4 |
|
Junior Member (25+)
|
Re: HOW GOOGLE WORKS ??
@Ajaykumar.. Thanks for the image..
|
|
|
|
|
|
#5 |
|
Elite Member (1000+)
|
Re: HOW GOOGLE WORKS ??
nice, explained in Image...
|
|
|
|
|
|
#6 |
|
Senior Member (500+)
|
Re: HOW GOOGLE WORKS ??
gr8,good job
![]() |
|
|
|
|
|
#7 |
|
Regular Member (100+)
Join Date: May 2006
Posts: 225
Thanks: 0
Thanked 27 Times in 20 Posts
Rep Power: 5
![]() ![]() |
Re: HOW GOOGLE WORKS ??
good post!
|
|
|
|
|
|
#8 |
|
Network Dude
Join Date: Nov 2005
Location: In the heaven of Technologies..
Posts: 73
Thanks: 2
Thanked 7 Times in 5 Posts
Rep Power: 4
![]() ![]() |
Re: HOW GOOGLE WORKS ??
Thanx for the post dark lord and thanx ajaykumar for the image...
__________________
http://networksolutions4u.blogspot.com |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| IE7 works ok until | ALex | Internet Explorer | 12 | 28-08-2007 09:52 PM |
| Google Romance - Pin All Your Romantic Hopes on Google! | Strider | General Discussions | 3 | 31-10-2006 08:13 AM |
| Google Introduces Business Coupons on Google Maps | sree | Software Releases | 1 | 19-08-2006 02:16 AM |
| Google Pack : A free collection of essential software from Google | Strider | Software Releases | 11 | 28-06-2006 12:05 AM |
| Google Romance - Pin All Your Romantic Hopes on Google | Strider | Technical Discussions | 2 | 03-04-2006 12:01 PM |