How to deal with large dataset when fetching url in google refine?

289 views Asked by At

So, I have an excel sheet with movie names around 190000 titles from freebase, I'd like to get information from wikipedia from fetching url by title, that would take a long time, I left my computer running for 8 hours and it just got to 2%. Sometime my internet got cut off I have to start it all over again from the beginning. Is there anyway I could do this 100 records at a time and continue until the end of the file so I could resume the process if my internet drops.

Thanks a lot.

1

There are 1 answers

2
Tom Morris On BEST ANSWER

~200K fetches is probably where you ought to start looking at using the Freebase or Wikipedia bulk dumps. The default Refine fetch rate interval is 5000 msec (ie 5 seconds) which is much longer than most web services require. You could probably lower that to 500 msec or less.

You don't need to run things from your personal computer. You can use Amazon's EC2 or another service with permanent connectivity and engineered uptime.

Unfortunately, Refine's "Add column by fetching URLs" operation is not restartable currently, so you do need to make sure that you can complete it. If you can't guarantee uptime/connectivity, your only other solutions are to a) do the operation in smaller chunks or b) use a different tool.