How do I get the HTML of a wiki page with Pywikibot?

2k views Asked by At

I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method. (beware: I'm not very skilled).

6

There are 6 answers

2
Aubrey On BEST ANSWER

I'll post here user283120 second answer, more precise than the first one:

Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API. If you need to, you can do it easily by using urllib2.

This is an example I used to get HTML of a wiki page in commons: import urllib2 ... url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_") html = urllib2.urlopen(url).read().decode('utf-8')

2
Nemo On

IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.

1
valepert On

"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"

source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py

0
Amir Sarabadani On

In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)

If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.

0
xqt On

With Pywikibot you may use http.request() to get the html content:

import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])

This should give the html content

'<title>Elvis Presley – Wikipedia</title>\n'

With Pywikibot 6.0 http.request() gives a requests.Response object rather than plain text. In this case you must use the text Attribute:

print(r.text[94:135])

to get the same result.

2
Wolfgang Fahl On

The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.

For the pywikibot library there is already a function implemented which you can use like this:

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
            
        Returns:
            str: the rendered HTML code for the page
        '''
        page=self.getPage(pageTitle)
        html=page._get_parsed_page()
        return html

When using the mwclient python library there is a generic api method see: https://github.com/mwclient/mwclient/blob/master/mwclient/client.py

Which can be used to retrieve the html code like this:

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
        '''
        api=self.getSite().api("parse",page=pageTitle)
        if not "parse" in api:
            raise Exception("could not retrieve html for page %s" % pageTitle)
        html=api["parse"]["text"]["*"]
        return html   

As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval