I need to migrate a digital repository to a new platform, but lack access to the old platform so I have resorted to retrieving the objects over the web.
Some objects contain other objects. For most objects of this type, identifying/retrieving the components and their metadata is a straightforward process. But for some PDF files, it appears that the components referred to are actually references to individual pages within a single file rather than separate pages.
For example, http://content.wwu.edu/cdm4/document.php?CISOROOT=/wfront&CISOPTR=2711 gives me an object with 4 pages. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711&CISOMODE=print allows me to retrieve the entire document. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711 retrieves an XML document telling me the identifiers for the component pages, but when I try to curl them, I just get zero length docs. But using the same method when non PDF docs are involved, I get actual files -- this is why I think only individual pages are being retrieved.
How can I retrieve the individual pages, as I must store these as individual objects in the new platform? Thanks
Bottom line is that it appears this is only possible if there is something on the server that will extract individual pages for you.
When I turned on wireshark, I found that actions on the user interface were invoking a call to a server side PDF application using the syntax:
http://content.wwu.edu/cgi-bin/showpdf.exe?CISOROOT=/wfront&CISOPTR=2711&CISOPAGE=3
where 2711 is the name of the object and 3 is the page of the file in question. Further experimentation revealed that I could pull up any page for any PDF I could identify.
For anyone else with a similar problem, wireshark is your friend.