How Google search engine works

  • Crawling.
    Crawling is the process by which Googlebot discovers new
    and updated Google uses a huge set of computers to fetch (or
    “crawl”) billions of pages on the web. The program that does the
    fetching is called Googlebot (also known as a robot, bot, or
    spider).
  • Indexing.
    Googlebot processes each of the pages it crawls in order to compile a massive index of all the words it sees and their location on each page. In addition, bot processes information included in key content tags and attributes, such as Title tags and ALT attributes.
  • Serving results/Ranking.
    When you enter a query, machines search the index for matching pages. Relevancy is determined by over 200 factors, one of which is the Page Ranking.

What can you do to help Google crawl and index your site


Title and Meta
Titles are important and very visible to users. Having a descriptive and accurate title allows people to choose your page above others. Maximum 68 characters.

<title>Your title goes here</title>

The meta description is a 150 character snippet, a tag in HTML, that summarizes a page’s content. Search engines show the meta description in search results mostly when the searched for phrase is contained in the description.

Alt tags
ALT tags provide a text alternative to an image.
They are a way to “describe” an image to those who can not see the image. The most important function of an ALT tag is to explain to a blind user/Googlebot what an image is displaying.

  • Titles are page specific and should be accurate and descriptive
  • No two titles on your website should be the same
  • ALT tags should be used to accurately describe the images on your page

    < src=”CampLogo.gif” alt=”Camp 2011 logo”>

robots.txt file

What is a robots.txt file?
The first thing a search engine Googlebot looks at when it is visiting your page is the robots.txt The robots.txt file is a simple text file placed on your web server which tells webcrawlers like Googlebot if they should access your site directories and files or not.

Where should I have rorbots.txt file?

www.yourwebsite.com/robots.txt

Create robots.txt file

User-agent: * or  User-agent: Googlebot
Disallow: /photos
Allow: /photos/mycar.jpg

How robots.txt file can help

  • You have content you want blocked from search engines
  • You are using paid links or advertisements that need special instructions for robots
  • You want to fine tune access to your site from reputable
  • You are developing a site that is live, but you do not want
    search engines to index it yet

Three ways to add robots.txt file to your Django project

1. The one-line
Add to urls

from django.http import HttpResponse

urlpatterns = patterns('', ... 
(r'^robots.txt$', lambda r: HttpResponse("User-agent: *\nDisallow: /", mimetype="text/plain")) )

Advantages: it is a simple one-liner disallowing all bots, with no extra files to be created, and It’s as simple as it gets.
Disadvantages: It is the missing scalability. urls.py is not the right place for content of any kind.

2. Direct to template
Just drop a robots.txt file into your main templates directory :

from django.views.generic import TemplateView

urlpatterns = patterns(

(r'^robots\.txt$',TemplateView.as_view
(template_name=’staticpages/robots.txt',   content_type='text/plain')),

)

Advantages: simplicity, and if you already have a robots.txt file you want to reuse, there’s no overhead for that.
Disadvantages: If your robots file changes somewhat frequently, you need to push changes to your web server every time.

3. The django-robots app
You can install django app and add it to your INSTALLED_APPS:
django-robots

Advantages: if you have a lot of rules, or if you need a site admin to change them without pushing changes to the web server.
Disadvantages: For small projects, this would be overkill.

Sitemap file

Why do I need a sitemap?
A sitemap is a file where you can list the web pages of your site to tell Google and other search engines about the organization of your site content. It makes easier for Google to discover the pages on your site. Search engine web crawlers like Googlebot read this file to more intelligently crawl your site.

How to build a sitemap:

  • Decide which pages on your site should be crawled by Google, and
    determine the canonical version of each page.

  • Decide which sitemap format you want to use – XML, RSS, mRSS, Atom 1.0 or Text Test your sitemap using the Search Console Sitemaps testing tool.

  • Make your sitemap available to Google by adding it
    to your robots.txt file and submitting it to Search Console. Sitemap:
    http://example.com/sitemap.xml

How to add sitemap to Django?

URL

from django.contrib.sitemaps.views import sitemap

url(r'^sitemap\.xml$', 
sitemap, {'sitemaps': sitemaps},
name='django.contrib.sitemaps.views.sitemap')

This tells Django to build a sitemap when a client accesses /sitemap.xml.
The name of the sitemap file is not important, but the location is. Search engines will only index links in your sitemap for the current URL level and below.

The sitemap view takes an extra, required argument: {'sitemaps': sitemaps}. sitemaps should be a dictionary that maps a short section label (e.g., blog or news) to its Sitemap class (e.g., BlogSitemap or NewsSitemap).

View

from django.contrib.sitemaps import Sitemap
from main.models import Item

class TodoSitemap(Sitemap):
    changefreq = "weekly"
    priority = 0.5

    def items(self):
        return Item.objects.all()

What is If-Modified-Since

  • The if modified since header is a HTTP header that is sent to a
    server as a conditional request.
  • If the content has not changed the server responds by providing only the headers with a 304 status code.
  • If the content has changed the server responds to the request with a
    200 status code and the entire requested document / resource.

The If-Modified-Since HTTP header essentially tells Googlebot one of two things about website:

  • This webpage has not changed, no need to download again.
    OR
  • This webpage has changed so download again because there is new
    information.

How to add If-Modified-Since to Django?

For each page (response) that Django sends back from a view, it might provide two HTTP header - the Last-Modified header.
You can create functions to rapidly compute the last-modified without view using: django.views.decorators.http.condition decorator

Model:

import datetime
from django.db import models

class Blog(models.Model):
class Entry(models.Model):
    blog = models.ForeignKey(Blog)
    (default=datetime.datetime.now)

If the front page, displaying the latest blog entries, only changes when you add a new blog entry, you can compute the last modified time very quickly. You need the latest published date for every entry associated with that blog. One way to do this would be:

FRONTPAGE VIEW:

def latest_entry(request, blog_id):
    return Entry.objects.filter(blog=blog_id).
            latest("published").published

You can then use this function to provide early detection of an unchanged page for your front page view

FRONTPAGE VIEW:

from django.views.decorators.http import condition

@condition(last_modified_func=latest_entry)
def front_page(request, blog_id):

Let know Google your site is updated

Pinging Google
You may want to “ping” Google when your sitemap changes, to let it know to reindex your site.

Pinging Google via manage.py
django-admin ping_google [sitemap_url]

Once the sitemaps application is added to your project, you may also ping Google using the ping_google management command:

python manage.py ping_google [/sitemap.xml]

This titurial is provided by ClickAces you can find more info at ClickAces.com