How do I push the result of this complex command line grep statement to mysql database?

430 views Asked by At

This code searches through website html files and extracts a list of domain names...

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)'

The result looks like this.

  • domain1.com
  • domain2.com
  • domain3.com

I plan to use this code on very large websites, therefore this will generate a very large list of domain names. In addition, the above code generates a lot of duplicate domain names. Therefore, I setup a mysql database with a unique field so duplicates will not be inserted.

Using my limited knowledge of programming I hacked together this line below, but this is not working. When I execute the command, I get no error, just a new command prompt of > and a blinking cursor. I assume I'm not using the correct syntax or methodology, and/or maybe what I want to do is not possible via command line. Any help is much appreciated.

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | domain=“$(grep -iEo '[[:alnum:]-]+\.(com|net|org)’)” | mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($domain)”

And yes, my database name is domains, and my table name is domains, and my field name is domains.

1

There are 1 answers

0
Jonathan Leffler On

Judging from the MySQL syntax for INSERT:

INSERT [LOW_PRIORITY | DELAYED | HIGH_PRIORITY] [IGNORE]
    [INTO] tbl_name
    [PARTITION (partition_name,…)] 
    [(col_name,…)]
    {VALUES | VALUE} ({expr | DEFAULT},…),(…),…
    …

you need to convert the domain names into parenthesized, quoted, comma separated items:

('domain1.com'),('domain2.com'),…

and then attach this list to the end of the INSERT statement you generated.

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" |
grep -iEo '[[:alnum:]-]+\.(com|net|org)’ |
sort -u |
sed -e "s/.*/,('&')/" -e '1s/,/INSERT IGNORE INTO domains.domains(domain) VALUES /' |
mysql -pPASSWORD

The sort -u ensures that the names are unique. The first -e to sed converts the contents of a line (e.g. domain1.com) into ,('domain1.com); the second -e removes the comma of the first line (added by the first -e) and replaces it with the INSERT prefix. The IGNORE in the INSERT statement means that if a domain is already in the table, the new entry will be ignored.

Clearly, if the number of domains generated is too large for a valid SQL statement in MySQL, you'll have to do some splitting of the data, but you're likely to be able to process a few thousand domains at a time.