How Search Engines Process Links

Ever wondered how search engines crawl, analyze, index, and rank pages? Columnist Jenny Halasz has created a helpful primer on the link graph to answer these questions.

Have you ever wondered why 404s, rel=canonicals, noindex, nofollow, and robots.txt work the way they do? Or have you never been clear on quite how they do all work? To help you understand, here is a very basic interpretation of how search engines crawl pages and add links to the link graph.

The Simple Crawl

The search engine crawler (let’s make it a spider for fun) visits a site. The first thing it collects is the robots.txt file.
Let’s assume that file either doesn’t exist or says it’s okay to crawl the whole site. The crawler collects information about all of those pages and feeds it back into a database. Strictly, it’s a crawl scheduling system that de-duplicates and shuffles pages by priority to index later.
Basic crawl process

While it’s there, it collects a list of all the pages each page links to. If they’re internal links, the crawler will probably follow them to other pages. If they’re external, they get put into a database for later.
search-engine-spider2

Processing Links

Later on, when the link graph gets processed, the search engine pulls all those links out of the database and connects them, assigning relative values to them. The values may be positive, or they may be negative. Let’s imagine, for example, that one of the pages is spamming. If that page is linking to other pages, it may be passing some bad link value on to those pages. Let’s say S=Spammer, and G=Good:
search-engine-spider3

The page on the top right has more G’s than S’s. Therefore, it would earn a fairly good score. A page with only G’s would earn a better score. If the S’s outweighed the G’s, the page would earn a fairly poor score. Add to that the complications that some S’s and some G’s are worth more than others, and you have a very simplified view of how the link graph works.

Blocking Pages With Robots.txt

Let’s go back to that original example. Suppose the robots.txt file had told the search engine not to access one of those pages.
search-engine-spider4

That means that while the search engine was crawling through the pages and making lists of links, it wouldn’t have any data about that page that was included in the robots.txt file.
Now, go back to that super simple link graph example. Let’s suppose that the page on the top right was that page that was blocked by robots.txt:
search-engine-spider5

The search engine is still going to take all of the links to that page and count them. It won’t be able to see what pages that page links to, but it will be able to add link value metrics for the page — which affects the domain as a whole.

Using 404 Or 410 To Remove Pages

Next, let’s assume that instead of blocking that page with robots.txt, we simply removed it. So the search engine would try to access it, but get a clear message that it’s not there anymore.
search-engine-spider6

This means that when the link graph is processed, links to that page just go away. They get stored for later use if that page comes back.
search-engine-spider7

At some other point (and likely by a different set of servers!), priority pages that are crawled get assigned to an index.

How The Index Works

The index identifies words and elements on a page that match with words and elements in the database. Do a search for “blue widgets.” The search engine uses the database to find pages that are related to blue, widgets, and blue widgets. If the search engine also considers widget (singular) and cornflower (a type of blue) to be synonyms, it may evaluate pages with those words on the page as well.
The search engine uses its algorithm to determine which pages in the index have those words assigned to them, evaluates links pointing to the page and the domain, and processes dozens of other known and unknown metrics to arrive at a value. If the site is being filtered for poor behavior like Panda or Penguin, that is also taken into account. The overall value then determines where in the results the page will appear.
This is further complicated by things webmasters might do to manipulate values. For example, if two pages are very similar, a webmaster may decide to use rel=canonical to signal the search engine that only one of those pages has value. This is not definitive, though. If the “cornflower widget” page is rel=canonical-ed to the “blue widgets” page, but the cornflower widget page has more valuable links pointing to it, the search engine may choose to use the cornflower widget page instead. If the canonical is accepted, the values of both elements on the pages and links pointing to the pages are combined.

Removing Pages With No Index

Noindex is more definitive. It works similarly to robots.txt except that instead of being prevented from crawling that page, the search engine is able to access it, but then is told to go away. The search engine will still collect links on the page to add to the database (unless a directive on the page also indicates not to follow them, i.e. nofollow), and it will still assign value to links pointing to that page.
search-engine-spider8

However, it will not consolidate value with any other pages, and it will not stop value from flowing through the page. All noindex does is request the search engine not assign the page to its index.
Therefore, there is only one definitive way to stop the flow of link value at the destination. Taking the page away completely (404 or 410 status) is the only way to stop it. 410 is more definitive than 404, as you can read here, but both will cause the page to be dropped out of the index eventually. There are multiple other ways to stop link flow from the origination of the link, but webmasters seldom have control over other sites, only their own.

Google Now Supports Crawling & Indexing Locale-Adaptive Web Pages:

Google adds support for web pages that dynamically change their content based on IP origin or language settings.

Google announced new support for their crawlers with locale-adaptive web pages, web pages that change the content based on the user’s country origin and/or language settings.
For example, web sites/web pages that you visit that automatically change the content on the page based on you coming from France versus the United States or web pages that automatically change the content on the page based on your language settings being set on French versus U.S.-English. Previously, Google didn’t handle that well or actually at all – they would just see the U.S.-English version of the web page.
Now Google will be able to handle such content by sending GoogleBot from different IPs across the world, as well as let it set language settings. Here are the two methods:

Geo-distributed crawling where Googlebot would start to use IP addresses that appear to be coming from outside the USA, in addition to the current IP addresses that appear to be from the USA that Googlebot currently uses.
Language-dependent crawling where Googlebot would start to crawl with an Accept-Language HTTP header in the request.

Google still strongly recommends you use different URLs or TLDs based on having content specific for different countries or languages. Google said:

These new configurations do not alter our recommendation to use separate URLs with rel=alternate hreflang annotations for each locale. We continue to support and recommend using separate URLs as they are still the best way for users to interact and share your content, and also to maximize indexing and better ranking of all variants of your content.

So if you are using the rel=alternate hreflang annotations with separate URLs, keep doing so.
This new support is meant to solve a problem Google is having with understanding sites that deploy locale-adaptive techniques. Google said, “new crawling configurations are enabled automatically for pages we detect to be locale-adaptive, you may notice changes in how we crawl and show your site in Google search results without you altering your CMS or server settings.”

Wednesday, 27 May 2015

How Search Engines Process Links

Ever wondered how search engines crawl, analyze, index, and rank pages? Columnist Jenny Halasz has created a helpful primer on the link graph to answer these questions.

The Simple Crawl

Processing Links

Blocking Pages With Robots.txt

Using 404 Or 410 To Remove Pages

How The Index Works

Removing Pages With No Index

Google Now Supports Crawling & Indexing Locale-Adaptive Web Pages:

Google adds support for web pages that dynamically change their content based on IP origin or language settings.

Tuesday, 19 May 2015

Best SEO And Digital Marketing Services Starts From $199/Month.

To request a proposal please write us on: pooorneshh@gmail.com

Skype: poornesh.mishra

Call: +91 983 984 0431

About Me