google safe browsing api url encoding (canonicalization)

711 views Asked by At

In my application I am checking user-entered urls for malware by sending them to google.

To test getting a "malware found" reaction I used the url http://malware.testing.google.test/testing/malware

To my surprise this url was not marked as malware

In fiddling about I found out that when I enter a trailing slash, it does get picked up as malware.

In the documentation it says the url's need to be canonicalized.

Do any of you know of an implementation of this requirement? (preferably in c#)

2

There are 2 answers

0
Antoon Meijer On BEST ANSWER

Using the link ForguesR provided I have created this C# implementation.

It passes 26 out of the 33 tests from the google test suite found at: https://developers.google.com/safe-browsing/developers_guide_v3#Canonicalization

It has been deemed good enough for production since it doesnt catch the more obsure webpages.

Code: https://dotnetfiddle.net/xO9sWl

1
ForguesR On

I am working on the same problem right now and the only thing I have found is a Java implementation in the jGoogleSafeBrowsing library. Unfortunately, it is stuck to v2 of the API.

Anyhow, you can have a look at the canonicalization code here. Be aware that :