My company merged from an old CMS to a new one and finally structured the website in logical order. I'm now the guy that will need to find about 500 pages that are now 404 to find the new pages and insert that info in a redirection file. Those 500 pages will come to me as a list from the old website from which I'll have to find the new website version of that content.
Of course, none of the URL matches (old one with IDs, new one with friendly URL). But for a big majority, the content is the same. Note that I have access to a staging version of the old site and the new one is live and well referenced.
Knowing that, I'm wondering if someone could tell me some tricks to maybe automate or at least ease the "pain" on manually crawling both sites to find related versions.
Thanks a lot
The following pseudo code should work:
Explanation:
Because you have the url list and the content of your 500 pages for both CMS, the process is not that hard.
You should have a structure that can map url to content for both CMS.
Let say we will use two maps named
OldUrlContent
andnewUrlContent
, an output textfile namedout.txt
, another one namedlevenshtein.txt
and another one namederror.txt
.Just iterate recursively(if there is a lot of subfolders) in the folder of your html files, open each file and store it in your map.
Once you have every url and content in your both maps, you will have to iterate on the
OldUrlContent
map, to take the hashed value (something like MD5) of the content and to compare the hash with every hashed content of yournewUrlContent
.2.1 If there is a correspondance, you save the correspondance in
out.txt
.2.2 If there is no correspondance, it is probably because there is a change in the content in the content of a page between your two CMS. you could use the Levenshtein distance (it is not a foolproof solution) to determine if two pages content are the same. If you have a small distance, you could determine the correspondance and you could save it in
out.txt
andlevenshtein.txt
.2.3 If you have no correspondance even with the Levenshtein distance, keep the url in
error.txt
.At the end, you should review by hand
error.txt
and to manually find the pages.Because for a big majority, the content is the same, the
levenshtein.txt
file should not contain a lot of url, but review the correspondance, just to be sure.