google apps script: convert an .xls (XHTML) to gsheet spreadsheet

43 views Asked by At

what I need to do

I have this .xls provided to me by third party every month that has 2 tables, I need to retrieve the data from the second table promatically using google apps scrips so I can log that to another sheet for further manipulation.

problem

the .xls file, when converted to a gsheet using Drive API, so I can go over the content to pick what I need, has its content not in the expected spreadsheet format but in this text TBODY > TR > TD format (I think it is XHTML) enter image description here

I know I can open the file and save it as xlsx file which should get the proper excel-to-gsheet converstion in the way intended, but I'd rather not do that if I can

I know I could possibly parse through file and get my data by looking at what's between ..., but I rather work with arrays of rows [[cell1],[cell2],[cell3]] if I can

I tried to use Drive API v2 (Drive.Files.insert) and v3 (Drive.Files.create) to convert the .xls file into to a gsheet but both return a gsheet that has this TBODY > TR > TD format as the content of the file.

Is there a way to get the gsheet in the intended normal spreadsheet format via google apps script?

Thank you in advance

Update:

the data that I'm dealing with looks on the gsheet like the sample below. uppon further inspection I noticed that there are some tables within tables so I decided to focus only on the data (table content) that I need so in the samble below I flagged the start and the end of the data I need and the data that I'm interested in that is contained within those limits, the start and end should be consistent meaning that they will very likely appear on every single file I would be scraping for data that look like this. Lastly, each line below is contained within a cell in column A of my gsheet, as far as I can tell nothing is bleeding over to column B

...
<NOBR>Some header</NOBR> -- start of data I need
</B>
</TD>
<TD width="65" align="left">
<B>
<NOBR>Product Description</NOBR>
</B>
</TD>
<TD align="left">
<B>
<NOBR>Domain Name</NOBR>
</B>
</TD>
...
<TR>
<TD align="left">01-Jan-2024 </TD>
<TD align="left">id1</TD>
<TD align="left">brand1</TD>
<TD align="left">product1</TD>
<TD align="left">abc.com</TD>
<TD align="left">tld1</TD>
<TD align="left">1</TD>
<TD align="right">value1 currencyX</TD>
</TR>
....
<TR>
<TD align="left">01-Jan-2024 </TD>
<TD align="left">id204</TD>
<TD align="left">brand67</TD>
<TD align="left">product99</TD>
<TD align="left">xyz.tld2</TD>
<TD align="left">tld2</TD>
<TD align="left">1</TD>
<TD align="right">value2 currency3</TD>
</TR>
...
TOTAL (Qtys: xxx) --end of the data I need
...
1

There are 1 answers

0
Francisco Cortes On

so here's what I figured out, maybe a more generic solution could be derived from it like capturing an array of tables if there are multiple tables that someone is interested in.

this solution is more like a workaround that I figured out, as it will return a 2d array that can be used to populate a table in gsheet (that part not included here), effectively converting xhtml in a table-like structure, not perfect but suits my needs. hope it helps someone

function parseXhtmlTable(ssid) {
  
  //get the data from the sheet as is
  var tableData = SpreadsheetApp.openById(ssid).getSheets()[0].getDataRange().getValues();

  // Initialize 2D arrays to store parsed data  
  var rows = [];
  var row = [];

  // Flag if we're capturing data and if we're within a row
  let captureData = false;
  let inRow = false;
  
  //set the cell patterns so capture cell content
  var cellPattern = /<TD.*>(.*)<\/TD>/

  // Iterate through each row (table data)
  for (var i = 0; i<tableData.length; i++) {

    // Get the string from the current row
    var rowString = tableData[i][0].toString();
    //console.log(rowString)

    //should we start capturing data
    if(rowString.includes('<NOBR>Some Header</NOBR>')){
      //console.log(i)
      captureData = true;
      continue;
    }

    //should we stop capturing data and break out
    if(rowString.includes('TOTAL (')){
      //console.log(i)
      captureData = false
      break;
    }

    //as long as we're capturing data
    if(captureData){
      
      // flag Start of row end if the corresponding value is found
      if (rowString.includes('<TR>')) {
        inRow = true;
      }
      
      //flag end of row if the corresponding value is found
      if (rowString.includes('</TR>')) {
        
        //not longer within a row
        inRow = false;
        
        //if row is not empty push that to rows array
        if (row != []){
          rows.push(row);
        }
        
        //reinitialize row variable
        row = []; 
      }

      // if in row, and row matches a cell with a distinct pattern
      if (inRow && cellPattern.test(rowString)) {
        
        //get the cell value
        let cellContent = rowString.match(cellPattern)[1]
        
        //and push value to a row
        row.push(cellContent)
      }
    }
  }