I recently finished building a website and while trying to get the site indexed by Google I seem to be getting some weird happenings and was hoping someone could shed some light on this as my Google-fu has revealed nothing.
The server stack I'm running is made up of:
Debian 7 / Apache 2.2.22 / MySQL 5.5.31 / PHP 5.4.4-14
The problem I'm having is Google seems to want to index some odd URLs and is currently ranking them higher than actual legitimate pages. I will list the odd ones here:
www.mydomain.com/srv/www/mydomain?srv/www/mydomain
www.mydomain.com/srv/www?srv/www
www.mydomain.com/srv/www?srv/www/indexā€ˇ
Webmaster tools now tell me 'this is an important page blocked by robots.txt' because as soon as I found the issue, I put some 301 redirects into the htaccess
file to send these requests to the homepage and blocked the addresses in the robots file.
Also, I have submitted an XML sitemap with all the correct URLs to webmaster tools.
All the website files are stored in:
/srv/www/mydomain/public_html/
Now, I think this has something to do with the way I've set up my .htaccess mod-rewrite rules, but I can't seem to get my head around what is doing it. It could also be my Apache vhosts configuration. I will include both below:
.htaccess mod-rewrite
rules:
<IfModule mod_rewrite.c>
RewriteEngine on
# Redirect requests for all non-canonical domains
# to same page in www.mydomain.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com$
RewriteRule (.*) http://www.mydomain.com/$1 [R=301,L]
# Remove .php file extension
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}\.php -f
RewriteRule ^(.*)$ $1.php
# redirect all traffic to index
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^ index [L]
# Remove 'index' from URL
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\s(.*)/index [NC]
RewriteRule ^ / [R=301,L]
</IfModule>
Apache Vhost:
<VirtualHost *:80>
ServerAdmin [email protected]
ServerName mydomain.com
ServerAlias www.mydomain.com
DocumentRoot /srv/www/mydomain/public_html/
ErrorLog /srv/www/mydomain/logs/error.log
CustomLog /srv/www/mydomain/logs/access.log combined
</VirtualHost>
Also, if it might be relevant, my PHP page handling is:
# Declare the Page array
$Page = array();
# Get the requested path and trim leading slashes
$Page['Path'] = ltrim($_SERVER['REQUEST_URI'], '/');
# Check for query string
if (strpos($Page['Path'], '?') !== false) {
# Seperate path and query string
$Page['Query'] = explode('?', $Page['Path'])['1'];
$Page['Path'] = explode('?', $Page['Path'])['0'];
}
# Check a path was supplied
if ($Page['Path'] != '') {
# Select page data from the directory
$Page['Data'] = SelectData('Directory', 'Path', '=', $Page['Path']);
# Check a page was returned
if ($Page['Data'] != null) {
# switch through allowed page types
switch ($Page['Data']['Type']) {
# There are a bunch of switch cases here that
# Determine what page to serve based on the
# page type stored in the directory
}
# When no page is returned
} else {
# 404
$Page = Build404ErrorPage($Page);
}
# When no path supplied
} else {
# Build the Home page
$Page = BuildHomePage($Page);
}
Can anyone see anything here that would be causing this?
After much research I have concluded that my problems came about due to a combination of Google attempting to index the website before it was completed and some incomplete page handling scripts. My mistake was not blocking all robots while in development.
The solution to the problem was this:
Submit an xml sitemap to google webmaster tools with all the valid URL's
301 Redirect all odd URL's to the correct homepage
Request removal of incorrect URL's using google webmaster tools
Block googlebot's access to the incorrect URL's using a robots.txt file
Wait for Google to re-crawl the site and correctly index it.
Waiting for googlebot to correct the issues was the hardest part.