Extracting text from STEP files

386 views Asked by At

I have a number of STEP files with text embedded in them, which I need to extract. Unfortunately, text in STEP files doesn't come in the form of characters, but in the form of curves, vertexes, splines, etc. making it very difficult to identify in the file. This project is in Python.

My current idea to do this is to build a library of letters, numbers, and punctuation, basically a series of ~40 tiny STEP files that contain a single character. In order to identify and extract text, the structure in the library STEPs will be compared to the main STEP, and anything matching the library template will be identified as its corresponding character; all the characters in the STEP files have the same "font", so to speak, so this should be viable for now.

I know this isn't a very good solution, but it's the best I've been able to come up with after a couple weeks of research. The first step is, of course, compiling the library, for which I am trying to using PythonOCC to "clip out" the characters. Unfortunately, I'm having some difficulty understanding its documentation, and I'm not terribly familiar with OCC in general, so I've been having trouble figuring out how to clip out the bits of STEP file code that represent a given character.

I know the coordinates of a given character thanks to FreeCAD, and with it I've been able to identify the "location" of a given character in the TopExp_Explorer crawler. I am having trouble figuring out how to extract anything, though:

def getShape(file_path: str):
    shape = read_step_file(file_path)
    return shape

def print_vertex(va):
    return BRep_Tool().Pnt(va).Coord(1), BRep_Tool().Pnt(va).Coord(2), BRep_Tool().Pnt(va).Coord(3)


file_path = r'<path>'
myshape = getShape(file_path)

topExp = TopExp_Explorer()
topExp.Init(myshape, TopAbs_EDGE)

x = (5948,5960); y = (330,344) #D
while topExp.More():
    edge = topExp.Current()
    first = print_vertex(topexp_FirstVertex(edge))
    last = print_vertex(topexp_LastVertex(edge))
    if (x[0] < first[0] < x[1]) & (y[0] < first[1] < y[1]) & \
       (x[0] < last[0] < x[1]) & (y[0] < last[1] < y[1]):
        #*Something should happen here*
    topExp.Next()

The characters appear to be comprised of edges and points -- for example, the "D" I have given coordinates for is made of a single vertex and a single edge that forms the whole "D" shape -- so in theory there should not be many objects to extract for any given character. The problem is I need not just a PythonOCC edge object, but a STEP file object, like a particular structure of VERTEX_POINTs and EDGE_LOOPs, to compile a character library and compare it to any given STEP file.

Does anyone know what should go in the IF statement? Alternatively, does anyone know a better way to go about this? Preferably something that can pull text from a STEP file directly . . .

0

There are 0 answers