Powershell: Storing website locally and preserving object type

368 views Asked by At

I want to preserve a website as object offline. I am using Powershell 5.1.19041.546 on Windows10

#online analysis (does work)

$website = Invoke-WebRequest https://www.w3schools.com/html/html_tables.asp
$website | gm

#I get an  Microsoft.PowerShell.Commands.HtmlWebResponseObject object
#next I use $website in this function (I call it Get-WebRequestTable) that expects a [Microsoft.PowerShell.Commands.HtmlWebResponseObject] $WebRequest, input object https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/

#offline analysis saving website locally and import it with get-content (does not work)

#saving the website locally
Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp -OutFile C:\temp\website
#writing the website back to a variable
$offlinedata = Get-Content C:\temp\website
#I get a string object
$offlinedata | gm
#String can not be used in function :Get-WebRequestTable : Cannot process argument transformation on parameter 'WebRequest'. Cannot convert the "System.Object[]" value of type "System.Object[]" to type "Microsoft.PowerShell.Commands.HtmlWebResponseObject".
Get-WebRequestTable -WebRequest $offlinedata

#offline analysis saving website locally as XML (does not work)

Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp  | Export-Clixml C:\temp\website.xml

this runs very long and I get the following XML (shorted)

<Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
  [...]                  <S>System.__ComObject</S>
                         <S>System.__ComObject</S>

It seems to create an endless loop at this point

 <S>System.__ComObject</S>

#converting it to json to store it locally (does not work)

$website = Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp 
$website | ConvertTo-Json

I get

ConvertTo-Json : An item with the same key has already been added.

Does anyone know a way how to store a website locally and later restore the [Microsoft.PowerShell.Commands.HtmlWebResponseObject] object for further processing?

1

There are 1 answers

0
Situ On

This code imports local html code to an "HtmlWebResponseObject" object

function convert-localhtml($localhtmlpath){
    $HTML = New-Object -Com "HTMLFile"
    $website = Get-Content "$localhtmlpath" -raw -ErrorAction Stop
    # Write HTML content according to DOM Level2 
    $HTML.IHTMLDocument2_write($website)
    $HTML
}

Kudos to Prateek Singh https://ridicurious.com/2017/01/24/powershell-tip-parsing-html-from-a-local-file-or-a-string/

I changed the code of lee holmes a bit so that it can handle both object types. [Microsoft.PowerShell.Commands.HtmlWebResponseObject] in case you use invoke-webrequest or [HTMLDocumentClass] in case you use convert-localhtml

https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/

Kudos to him for his great table extraction code

   function Get-WebRequestTable{
        param(
            [Parameter(Mandatory = $true)]
            $WebRequest,
            [Parameter(Mandatory = $true)]
            [int]$TableNumber
    
        )
    
          # Ensure that a supported type was passed.
      if (($WebRequest.GetType().Name -ne "HTMLDocumentClass") -and ($WebRequest.GetType().Name -ne "HtmlWebResponseObject")) { Throw "Unsupported argument type. Need [Microsoft.PowerShell.Commands.HtmlWebResponseObject] or [HTMLDocumentClass] " }
    
      if ($WebRequest -is [Microsoft.PowerShell.Commands.HtmlWebResponseObject]) {
      $tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE"))
      }
      else {
        #"[HTMLDocumentClass] arguments given."
        $tables = @($WebRequest.getElementsByTagName("TABLE"))
      }
        
        ## Extract the tables out of the web request
        
        $table = $tables[$TableNumber]
        $titles = @()
        $rows = @($table.Rows)
    
        ## Go through all of the rows in the table
    
        foreach ($row in $rows)
        {
            $cells = @($row.Cells)
            ## If we've found a table header, remember its titles
            if ($cells[0].tagName -eq "TH")
    
            {
    
                $titles = @($cells | ForEach-Object { ("" + $_.InnerText).Trim() })
    
                continue
    
            }
    
            ## If we haven't found any table headers, make up names "P1", "P2", etc.
    
            if (-not $titles)
    
            {
    
                $titles = @(1..($cells.Count + 2) | ForEach-Object { "P$_" })
    
            }
    
            ## Now go through the cells in the the row. For each, try to find the
    
            ## title that represents that column and create a hashtable mapping those
    
            ## titles to content
    
            $resultObject = [Ordered]@{}
    
            for ($counter = 0; $counter -lt $cells.Count; $counter++)
    
            {
    
                $title = $titles[$counter]
    
                if (-not $title) { continue }
    
    
    
                $resultObject[$title] = ("" + $cells[$counter].InnerText).Trim()
    
            }
    
            ## And finally cast that hashtable to a PSCustomObject
    
            [pscustomobject]$resultObject
    
        }
    
    }