This code searches through website html files and extracts a list of domain names...
httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)'
The result looks like this.
- domain1.com
- domain2.com
- domain3.com
I plan to use this code on very large websites, therefore this will generate a very large list of domain names. In addition, the above code generates a lot of duplicate domain names. Therefore, I setup a mysql database with a unique field so duplicates will not be inserted.
Using my limited knowledge of programming I hacked together this line below, but this is not working. When I execute the command, I get no error, just a new command prompt of > and a blinking cursor. I assume I'm not using the correct syntax or methodology, and/or maybe what I want to do is not possible via command line. Any help is much appreciated.
httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | domain=“$(grep -iEo '[[:alnum:]-]+\.(com|net|org)’)” | mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($domain)”
And yes, my database name is domains, and my table name is domains, and my field name is domains.
Judging from the MySQL syntax for INSERT:
you need to convert the domain names into parenthesized, quoted, comma separated items:
and then attach this list to the end of the INSERT statement you generated.
The
sort -u
ensures that the names are unique. The first-e
tosed
converts the contents of a line (e.g.domain1.com
) into,('domain1.com)
; the second-e
removes the comma of the first line (added by the first-e
) and replaces it with theINSERT
prefix. TheIGNORE
in theINSERT
statement means that if a domain is already in the table, the new entry will be ignored.Clearly, if the number of domains generated is too large for a valid SQL statement in MySQL, you'll have to do some splitting of the data, but you're likely to be able to process a few thousand domains at a time.