Parse Error for XML from url response (text file) with HTML block in starting

378 views Asked by At

I'm trying to scrape file from SEC Edgar's database. I'm able to get the text file using requests. When I try to parse the file using the following code I get parse error. The same code works when I request a .xml url and not a .txt url. Url has the following content:

<SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001
<ACCEPTANCE-DATETIME>20201001132951
ACCESSION NUMBER:       0001752724-20-203989
CONFORMED SUBMISSION TYPE:  NPORT-P
PUBLIC DOCUMENT COUNT:      2
CONFORMED PERIOD OF REPORT: 20200831
FILED AS OF DATE:       20201001
PERIOD START:               20201130

-------------
**
-------------
    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA LTD
        DATE OF NAME CHANGE:    20070301

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA BERMUDA LTD
        DATE OF NAME CHANGE:    20030505
</SEC-HEADER>
<DOCUMENT>
<TYPE>NPORT-P
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sec.gov/edgar/nport eis_NPORT_Filer.xsd">
  <headerData>
    <submissionType>NPORT-P</submissionType>
    <isConfidential>false</isConfidential>
    <filerInfo>

      <filer>
        <issuerCredentials>
          <cik>0001230869</cik>
          <ccc>XXXXXXXX</ccc>

My code:

url = 'https://www.sec.gov/Archives/edgar/data/1230869/0001752724-20-203989.txt'
response = requests.get(url)
root = ET.fromstring(response.content)

Error:

Traceback (most recent call last):

  File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
    root = ET.fromstring(response.content)

  File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38
0

There are 0 answers