How to create an AnalysisEngineDescriptor from an uima-ruta script to use in a SimplePipeline

991 views Asked by At

I'm not able to run an uima ruta script in my simple pipeline. I'm working with the next libraries:

  1. Uimafit 2.0.0
  2. Uima-ruta 2.0.1
  3. ClearTK 1.4.1
  4. Maven

And I'm using a org.apache.uima.fit.pipeline.SimplePipeline with:

SimplePipeline.runPipeline(
    UriCollectionReader.getCollectionReaderFromDirectory(filesDirectory), //directory with text files
    UriToDocumentTextAnnotator.getDescription(),
    StanfordCoreNLPAnnotator.getDescription(),//stanford tokenize, ssplit, pos, lemma, ner, parse, dcoref

    AnalysisEngineFactory.createEngineDescription(RUTA_ANALYSIS_ENGINE),//RUTA script

    AnalysisEngineFactory.createEngineDescription(//
        XWriter.class, 
        XWriter.PARAM_OUTPUT_DIRECTORY_NAME, outputDirectory,
        XWriter.PARAM_FILE_NAMER_CLASS_NAME, ViewURIFileNamer.class.getName())
);

What I'm trying to do is to use the StandfordNLP annotator(from ClearTK) and apply a ruta script. Currently, everything runs without errors and the default ruta annotations are being added to the CAS, but the annotations that my rules create are not being added to the CAS.

My script is:

PACKAGE edu.isistan.carcha.concern;
TYPESYSTEM org.cleartk.ClearTKTypeSystem; 
DECLARE persistence
Token{FEATURE("lemma","storage") -> MARK(persistence)};

Looking at the annotated file: enter image description here

The basic ruta annotations like "SPACE" or "SW" are there, so the RutaEngine is being created and added to the pipeline...

How do I properly create an AnalysisEngineDescriptor to run a Ruta script?

Notes: RUTA_ANALYSIS_ENGINE Its the engine descriptor that I copy from the RUTA workbench.

1

There are 1 answers

0
apatry On BEST ANSWER

Try to add a semi-column after the declaration and use a fully qualified name for the Token annotation :

PACKAGE edu.isistan.carcha.concern;
TYPESYSTEM org.cleartk.ClearTKTypeSystem; 
DECLARE persistence;
org.cleartk.token.type.Token{FEATURE("lemma","storage") -> MARK(persistence)};

Type aliasing in RUTA is a little bit too aggressive. Every types known to your pipeline will be available by its short name, even if you do not import them in your script. If you have more than one Token types available to your pipeline, there is currently no way to know which one will be picked (see https://issues.apache.org/jira/browse/UIMA-3322?filter=-2).