Google indexing server file path of website?

510 views Asked by At

I recently finished building a website and while trying to get the site indexed by Google I seem to be getting some weird happenings and was hoping someone could shed some light on this as my Google-fu has revealed nothing.

The server stack I'm running is made up of:

Debian 7 / Apache 2.2.22 / MySQL 5.5.31 / PHP 5.4.4-14

The problem I'm having is Google seems to want to index some odd URLs and is currently ranking them higher than actual legitimate pages. I will list the odd ones here:

www.mydomain.com/srv/www/mydomain?srv/www/mydomain
www.mydomain.com/srv/www?srv/www
www.mydomain.com/srv/www?srv/www/indexā€ˇ

Webmaster tools now tell me 'this is an important page blocked by robots.txt' because as soon as I found the issue, I put some 301 redirects into the htaccess file to send these requests to the homepage and blocked the addresses in the robots file.

Also, I have submitted an XML sitemap with all the correct URLs to webmaster tools.

All the website files are stored in:

/srv/www/mydomain/public_html/

Now, I think this has something to do with the way I've set up my .htaccess mod-rewrite rules, but I can't seem to get my head around what is doing it. It could also be my Apache vhosts configuration. I will include both below:

.htaccess mod-rewrite rules:

<IfModule mod_rewrite.c>
    RewriteEngine on

# Redirect requests for all non-canonical domains
# to same page in www.mydomain.com
    RewriteCond %{HTTP_HOST} .
    RewriteCond %{HTTP_HOST} !^www\.mydomain\.com$
    RewriteRule (.*) http://www.mydomain.com/$1 [R=301,L]


# Remove .php file extension
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{REQUEST_FILENAME}\.php -f
    RewriteRule ^(.*)$ $1.php

# redirect all traffic to index
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^ index [L]

# Remove 'index' from URL
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\s(.*)/index [NC]
    RewriteRule ^ / [R=301,L]

</IfModule>

Apache Vhost:

<VirtualHost *:80>
    ServerAdmin [email protected]
    ServerName mydomain.com
    ServerAlias www.mydomain.com
    DocumentRoot /srv/www/mydomain/public_html/
    ErrorLog /srv/www/mydomain/logs/error.log
    CustomLog /srv/www/mydomain/logs/access.log combined
</VirtualHost>

Also, if it might be relevant, my PHP page handling is:

# Declare the Page array
$Page = array();

# Get the requested path and trim leading slashes
$Page['Path'] = ltrim($_SERVER['REQUEST_URI'], '/');

# Check for query string
if (strpos($Page['Path'], '?') !== false) {

    # Seperate path and query string
    $Page['Query']  = explode('?', $Page['Path'])['1'];
    $Page['Path']   = explode('?', $Page['Path'])['0'];
}

# Check a path was supplied
if ($Page['Path'] != '') {

    # Select page data from the directory
    $Page['Data'] = SelectData('Directory', 'Path', '=', $Page['Path']);

    # Check a page was returned
    if ($Page['Data'] != null) {

        # switch through allowed page types
        switch ($Page['Data']['Type']) {

            # There are a bunch of switch cases here that
            # Determine what page to serve based on the
            # page type stored in the directory

        }

    # When no page is returned
    } else {

        # 404
        $Page = Build404ErrorPage($Page);
    }

# When no path supplied
} else {

    # Build the Home page
    $Page = BuildHomePage($Page);
}

Can anyone see anything here that would be causing this?

1

There are 1 answers

0
damndaewoo On BEST ANSWER

After much research I have concluded that my problems came about due to a combination of Google attempting to index the website before it was completed and some incomplete page handling scripts. My mistake was not blocking all robots while in development.

The solution to the problem was this:

  1. Submit an xml sitemap to google webmaster tools with all the valid URL's

  2. 301 Redirect all odd URL's to the correct homepage

  3. Request removal of incorrect URL's using google webmaster tools

  4. Block googlebot's access to the incorrect URL's using a robots.txt file

  5. Wait for Google to re-crawl the site and correctly index it.

Waiting for googlebot to correct the issues was the hardest part.