How do I get the HTML of a wiki page with Pywikibot?

Question

How do I get the HTML of a wiki page with Pywikibot?

2.1k views Asked by Aubrey At 12 December 2014 at 11:33

I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method. (beware: I'm not very skilled).

Original Q&A

There are 6 answers

Nemo On 12 December 2014 at 16:27

IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.

valepert On 12 December 2014 at 12:04

"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"

source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py

Amir Sarabadani On 12 December 2014 at 23:34

In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)

If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.

xqt On 03 February 2021 at 15:41

With Pywikibot you may use http.request() to get the html content:

import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])

This should give the html content

'<title>Elvis Presley – Wikipedia</title>\n'

With Pywikibot 6.0 http.request() gives a requests.Response object rather than plain text. In this case you must use the text Attribute:

print(r.text[94:135])

to get the same result.

Wolfgang Fahl On 27 December 2020 at 11:14

The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.

For the pywikibot library there is already a function implemented which you can use like this:

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
            
        Returns:
            str: the rendered HTML code for the page
        '''
        page=self.getPage(pageTitle)
        html=page._get_parsed_page()
        return html

When using the mwclient python library there is a generic api method see: https://github.com/mwclient/mwclient/blob/master/mwclient/client.py

Which can be used to retrieve the html code like this:

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
        '''
        api=self.getSite().api("parse",page=pageTitle)
        if not "parse" in api:
            raise Exception("could not retrieve html for page %s" % pageTitle)
        html=api["parse"]["text"]["*"]
        return html

As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval

**Aubrey** · Accepted Answer · 2014-12-14T22:54:00+00:00

I'll post here user283120 second answer, more precise than the first one:

Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API. If you need to, you can do it easily by using urllib2.

This is an example I used to get HTML of a wiki page in commons: import urllib2 ... url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_") html = urllib2.urlopen(url).read().decode('utf-8')

TechQA.

How do I get the HTML of a wiki page with Pywikibot?

There are 6 answers

Related Questions in PYTHON

Related Questions in WIKIPEDIA

Related Questions in WIKIPEDIA-API

Related Questions in PYWIKIBOT

Popular Questions

Popular Tags

Trending Questions