Parsing .edi file with pyspark

637 views Asked by At

I am trying to parse .edi files with pyspark. I load my files to spark_df with:

spark_df = spark.read.csv(adls_path)

and i get

enter image description here

How to parse _c0 column with PySpark so that every element from both rows is on its own row (it is str and delimiter is ')?

1

There are 1 answers

2
user238607 On

You can use the following python library to parse the EDI files into string format.

https://pypi.org/project/pydifact/

pip install pydifact

I am providing a simple code example of parsing EDI files from below location to their string format.

Sample EDI Files used from this location : https://github.com/smooks/unedifact-examples/tree/master/splitting-camel/sample-data

from pydifact.segmentcollection import Interchange


input_dir = "../data/edi-files"

for edi_file in pathlib.Path(input_dir).glob('*.edi'):

    print("EDI file processing", edi_file)
    interchange = Interchange.from_file(str(edi_file))

    for message in interchange.get_messages():
        for segment in message.segments:
            print('Segment tag: {}, content: {}'.format(
                segment.tag, segment.elements))

    print("File processing", edi_file, "ended")

You can convert the above function to a python udf and then call on the above _c0 and get raw string representaiton.

An example usage of how to call a simple python function as a udf. https://stackoverflow.com/a/34804340/3238085

Following are java libraries which do the same thing as the above python library.

First :

https://github.com/smooks/smooks/tree/master
Maven coordinates : org.smooks:smooks-core:2.0.0-RC2
Allows you to convert : EDI TO XML and then from XML to CSV

Second :

https://github.com/BerryWorksSoftware/edireader
Maven Coordinates : com.berryworks:edireader:5.6.4
Allows you to convert EDI TO XML

Third :

https://github.com/BerryWorksSoftware/edi-json/tree/master/repo/com/berryworks/edireader-json-basic/5.6.2
You can download the jar from this location
Allows you to convert EDI TO JSON 

If you decide to use the above jars, here's an example usage of how to call the java functions from the above jars in pyspark

Running custom Java class in PySpark

Output of the top python script :

EDI file processing ../data/edi-files/DESADV.edi
Segment tag: BGM, content: ['351', '19960445', '4', 'NA']
Segment tag: DTM, content: [['137', '199610180800', '203']]
Segment tag: DTM, content: [['69', '19961020', '102']]
Segment tag: RFF, content: [['ON', '1996100001']]
Segment tag: NAD, content: ['BY', ['7080000043217', '', '9']]
Segment tag: NAD, content: ['SU', ['7080000083121', '', '9']]
Segment tag: RFF, content: [['VA', 'FORETAKSREGISTERET NO987654321MVA']]
Segment tag: CTA, content: ['AD', ['', 'Hans Hansen']]
Segment tag: NAD, content: ['DP', ['7080000083122', '', '9']]
Segment tag: TOD, content: ['4', '', 'DD2']
Segment tag: CPS, content: ['1']
Segment tag: PAC, content: ['1', ['', '50'], '201']
Segment tag: MEA, content: ['PD', ['AAD', '3'], ['KGM', '2']]
Segment tag: HAN, content: [['FTD', '', '9']]
Segment tag: PCI, content: ['30E']
Segment tag: GIN, content: ['SS', '170325200000000185']
Segment tag: LIN, content: ['1', '', ['7037660000197', 'EN']]
Segment tag: PIA, content: ['1', ['12345', 'SA', '', '91']]
Segment tag: IMD, content: ['C', '', 'TU']
Segment tag: IMD, content: ['F', '', ['', '', '', 'HVETEMEL']]
Segment tag: QTY, content: [['12', '14']]
Segment tag: QTY, content: [['59', '6']]
Segment tag: RFF, content: [['ON', '19961198']]
Segment tag: PCI, content: ['30E']
Segment tag: DTM, content: [['137', '199610180800', '203']]
Segment tag: CNT, content: [['2', '1']]
File processing ../data/edi-files/DESADV.edi ended
EDI file processing ../data/edi-files/invoic-d93a.edi
Segment tag: BGM, content: ['380', '891206500']
Segment tag: DTM, content: [['137', '20100926', '102']]
Segment tag: NAD, content: ['II', ['SSESDL', '', '87']]
Segment tag: RFF, content: [['VA', 'SE5562503630']]
Segment tag: RFF, content: [['GN', '00075562503630']]
Segment tag: NAD, content: ['IV', ['33426776', '', '87']]
Segment tag: RFF, content: [['VA', '  SE5565268538']]
Segment tag: NAD, content: ['PE', '', ['SCHENKER AB', '412 97 GÖTEBORG']]
Segment tag: RFF, content: [['BGI', '9423047']]
Segment tag: RFF, content: [['PGI', '9423047']]
Segment tag: CUX, content: [['2', 'SEK', '10'], ['3', 'SEK', '11']]
Segment tag: PAT, content: ['3', '', '66']
Segment tag: DTM, content: [['13', '20101006', '102']]
Segment tag: PAT, content: ['20', '', ['66', '', 'M']]
Segment tag: PCD, content: [['15', '1.8', '13']]
Segment tag: LIN, content: ['1']
Segment tag: MEA, content: ['PD', 'AAD', ['KGM', '177']]
Segment tag: MEA, content: ['PD', 'VOL', ['MTQ', '0.864']]
Segment tag: QTY, content: [['100', '288', 'KGM']]
Segment tag: DTM, content: [['143', '20100927', '102']]
Segment tag: MOA, content: [['203', '1736']]
Segment tag: RFF, content: [['FF', 'SDL3116575']]
Segment tag: RFF, content: [['AAS', 'DDT 38']]
Segment tag: PAC, content: ['1']
Segment tag: LOC, content: ['5', ['20060', '16', '', 'MILANO']]
Segment tag: LOC, content: ['8', ['88152', '16', '', 'STOCKHOLM']]
Segment tag: LOC, content: ['35', ['IT', '162']]
Segment tag: LOC, content: ['28', ['SE', '162']]
Segment tag: NAD, content: ['CN', '', '', 'TT THERMOTECH SCANDINAVIA AB', ['BOX 69', 'NIPAN 59'], '881 22  SOLLEFTEÅ', '', '88122']
Segment tag: NAD, content: ['DP', '', '', 'TT THERMOTECH SCANDINAVIA', 'NIPAN 59', 'SOLLEFTEÅ', '', '88152']
Segment tag: ALC, content: ['C', '', '6', '', ['553', '', '87', 'FRAKT']]
Segment tag: MOA, content: [['8', '1233']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '308.25']]
Segment tag: ALC, content: ['C', '', '6', '', ['586', '', '87', 'VÄGSKATT TYSKLAND']]
Segment tag: MOA, content: [['8', '27']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '6.75']]
Segment tag: ALC, content: ['C', '', '6', '', ['735', '', '87', 'EXPEDITIONSAVGIFT INFÖRSEL']]
Segment tag: MOA, content: [['8', '260']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '65']]
Segment tag: ALC, content: ['C', '', '6', '', ['572', '', '87', 'DRIVMEDELSJUSTERING']]
Segment tag: MOA, content: [['8', '162']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '40.5']]
Segment tag: ALC, content: ['C', '', '6', '', ['573', '', '87', 'VALUTAJUSTERING']]
Segment tag: MOA, content: [['8', '54']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '13.5']]
Segment tag: TDT, content: ['20', '', '3', '', '', '', '', ['', '', '', 'TCO4040']]
Segment tag: UNS, content: ['S']
Segment tag: MOA, content: [['9', '2170']]
Segment tag: MOA, content: [['125', '1736']]
Segment tag: MOA, content: [['176', '434']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['125', '1736']]
Segment tag: MOA, content: [['176', '434']]
File processing ../data/edi-files/invoic-d93a.edi ended