How to deal with large dataset when fetching url in google refine?

Question

How to deal with large dataset when fetching url in google refine?

287 views Asked by toy At 10 August 2012 at 23:34

So, I have an excel sheet with movie names around 190000 titles from freebase, I'd like to get information from wikipedia from fetching url by title, that would take a long time, I left my computer running for 8 hours and it just got to 2%. Sometime my internet got cut off I have to start it all over again from the beginning. Is there anyway I could do this 100 records at a time and continue until the end of the file so I could resume the process if my internet drops.

Thanks a lot.

Original Q&A

There are 1 answers

**Tom Morris** · Accepted Answer · 2012-08-11T04:14:51+00:00

~200K fetches is probably where you ought to start looking at using the Freebase or Wikipedia bulk dumps. The default Refine fetch rate interval is 5000 msec (ie 5 seconds) which is much longer than most web services require. You could probably lower that to 500 msec or less.

You don't need to run things from your personal computer. You can use Amazon's EC2 or another service with permanent connectivity and engineered uptime.

Unfortunately, Refine's "Add column by fetching URLs" operation is not restartable currently, so you do need to make sure that you can complete it. If you can't guarantee uptime/connectivity, your only other solutions are to a) do the operation in smaller chunks or b) use a different tool.

TechQA.

How to deal with large dataset when fetching url in google refine?

There are 1 answers

Related Questions in FREEBASE

Related Questions in GOOGLE-REFINE

Popular Questions

Popular Tags

Trending Questions