In Heritrix crawler tool how to extract the contents from crawled urls

Question

In Heritrix crawler tool how to extract the contents from crawled urls

1k views Asked by Dharmaraja.k At 28 August 2013 at 11:04

Am new to heritrix tool and now i am able to crawl the web pages from www and now want to extract the contents of the crawled urls.

please help me any one.please.Thanks in Advance.

Original Q&A

There are 1 answers

**Dharmaraja.k** · Accepted Answer · 2013-09-23T07:13:11+00:00

 1.first download the file  wget http://python.org/ftp/python/3.3.0/Python-3.3.0.tgz or higher version as root user.
 2. change the directory to installed python
 3. example /opt/python3.3/;
 4. configure the files ./configure --prefix=/opt/python3.3
 5.make
 6. sudo make install
 7. /opt/python3.3/bin/python3
 8.opt/python3.3/bin/pyvenv ~/py33
 9.source ~/py33/bin/activate
 10. wget http://python-distribute.org/distribute_setup.py
 11.python distribute_setup.py  
 12. easy_install pip
 13. pip install bottle
 14. pip install warcat 
 15. if successfully installed warcat then check whether your warcat is installed or not.
 16. python3 -m warcat --help after enter then we can see some help commands like, list,concat,extract etc..
 17.python3 -m warcat list example/at.warc.gz
 this is worked for me ..enjoy

TechQA.

In Heritrix crawler tool how to extract the contents from crawled urls

There are 1 answers

Related Questions in JAVA

Related Questions in SPRING

Related Questions in HERITRIX

Popular Questions

Trending Questions