Tips to find 500+ old pages

117 views Asked by At

My company merged from an old CMS to a new one and finally structured the website in logical order. I'm now the guy that will need to find about 500 pages that are now 404 to find the new pages and insert that info in a redirection file. Those 500 pages will come to me as a list from the old website from which I'll have to find the new website version of that content.

Of course, none of the URL matches (old one with IDs, new one with friendly URL). But for a big majority, the content is the same. Note that I have access to a staging version of the old site and the new one is live and well referenced.

Knowing that, I'm wondering if someone could tell me some tricks to maybe automate or at least ease the "pain" on manually crawling both sites to find related versions.

Thanks a lot

1

There are 1 answers

2
Pier-Alexandre Bouchard On

The following pseudo code should work:

FOR i = 0 TO oldUrlContent.size
    correspondance = false;

    // Exact same content
    FOR j = 0 TO newUrlContent.size And correspondance = false
        IF md5(newUrlContent[i]) == md5(newUrlContent[j])
           WRITE to out.txt, oldUrl + newUrl
           correspondance = true
        END IF
    END FOR

    // Levenshtein distance
    FOR j = 0 TO newUrlContent.size And correspondance = false
        IF levenshtein(newUrlContent[i],newUrlContent[j]) < ACCEPTABLE_LEVENSHTEIN
           WRITE to out.txt, oldUrl + newUrl
           WRITE to levenshtein.txt, oldUrl + newUrl
           correspondance = true
        END IF
    END FOR

    // Nothing found
    IF correspondance = false
        WRITE to error.txt, oldUrl
    END IF
END FOR

Explanation:

Because you have the url list and the content of your 500 pages for both CMS, the process is not that hard.

  1. You should have a structure that can map url to content for both CMS.

    Let say we will use two maps named OldUrlContent and newUrlContent, an output textfile named out.txt, another one named levenshtein.txt and another one named error.txt.

    Just iterate recursively(if there is a lot of subfolders) in the folder of your html files, open each file and store it in your map.

  2. Once you have every url and content in your both maps, you will have to iterate on the OldUrlContent map, to take the hashed value (something like MD5) of the content and to compare the hash with every hashed content of your newUrlContent.

    2.1 If there is a correspondance, you save the correspondance in out.txt.

    2.2 If there is no correspondance, it is probably because there is a change in the content in the content of a page between your two CMS. you could use the Levenshtein distance (it is not a foolproof solution) to determine if two pages content are the same. If you have a small distance, you could determine the correspondance and you could save it in out.txt and levenshtein.txt.

    2.3 If you have no correspondance even with the Levenshtein distance, keep the url in error.txt.

At the end, you should review by hand error.txt and to manually find the pages.

Because for a big majority, the content is the same, the levenshtein.txt file should not contain a lot of url, but review the correspondance, just to be sure.