I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method. (beware: I'm not very skilled).
How do I get the HTML of a wiki page with Pywikibot?
2k views Asked by Aubrey AtThere are 6 answers
IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.
"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"
source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py
In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)
If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.
With Pywikibot you may use http.request()
to get the html content:
import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])
This should give the html content
'<title>Elvis Presley – Wikipedia</title>\n'
With Pywikibot 6.0 http.request()
gives a requests.Response
object rather than plain text. In this case you must use the text Attribute:
print(r.text[94:135])
to get the same result.
The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.
For the pywikibot library there is already a function implemented which you can use like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
Returns:
str: the rendered HTML code for the page
'''
page=self.getPage(pageTitle)
html=page._get_parsed_page()
return html
When using the mwclient python library there is a generic api method see: https://github.com/mwclient/mwclient/blob/master/mwclient/client.py
Which can be used to retrieve the html code like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
'''
api=self.getSite().api("parse",page=pageTitle)
if not "parse" in api:
raise Exception("could not retrieve html for page %s" % pageTitle)
html=api["parse"]["text"]["*"]
return html
As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval
I'll post here user283120 second answer, more precise than the first one:
Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API. If you need to, you can do it easily by using urllib2.
This is an example I used to get HTML of a wiki page in commons:
import urllib2 ... url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_") html = urllib2.urlopen(url).read().decode('utf-8')