I am trying to load a bunch of JSON files into MarkLogic 8 using MLCP and a basic transform script on ingest.
I can load the files as-is, I get JSON objects in ML.
What I want is to transform, on ingestion, from JSON to XML, so I wrote a basic transform like so :
xquery version "1.0-ml";
module namespace ingest = "http://dikw.com/ingest/linkedin";
import module namespace json="http://marklogic.com/xdmp/json" at "/MarkLogic/json/json.xqy";
import module namespace sem="http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
declare namespace basic="http://marklogic.com/xdmp/json/basic";
declare default function namespace "http://www.w3.org/2005/xpath-functions";
declare option xdmp:mapping "false";
declare function ingest:transform(
$content as map:map,
$context as map:map
) as map:map*
{
let $org-doc := map:get($content, "value")
let $jsonxml := json:transform-from-json($org-doc)
let $name := $jsonxml//basic:full__name
let $_ := xdmp:log(concat('Inserting linkedin profile ', $name, '.xml..'))
let $new-doc :=
document {
<json>{
$jsonxml
}</json>
}
return (
map:put($content, "value", $new-doc),
$content
)
};
Now if I use MLCP to load the docs without the transform it works but as stated above I get JSON inside ML8. (I use Roxy to invoke the right environment to load to for MLCP.)
./ml $ENV mlcp import -input_file_path content/linkedin -input_file_type documents
The above works ok.
But using the transform like so:
./ml $ENV mlcp import -input_file_path content/linkedin -input_file_type documents -transform_module /ingest/linkedin.xqy -output_collections incoming,incoming/linkedin
I get an error: "ERROR contentpump.MultithreadedMapper: Unknown content type: json"
15/06/22 17:37:12 INFO contentpump.ContentPump: Hadoop library version: 2.0.0-mr1-cdh4.3.0
15/06/22 17:37:12 INFO contentpump.LocalJobRunner: Content type is set to MIXED. The format of the inserted documents will be determined by the MIME type specification configured on MarkLogic Server.
15/06/22 17:37:12 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
15/06/22 17:37:12 INFO input.FileInputFormat: Total input paths to process : 9
15/06/22 17:37:13 ERROR contentpump.MultithreadedMapper: Unknown content type: json
java.lang.IllegalArgumentException: Unknown content type: json
at com.marklogic.mapreduce.ContentType.forName(ContentType.java:107)
at com.marklogic.contentpump.utilities.TransformHelper.getTransformInsertQry(TransformHelper.java:124)
at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:97)
at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:46)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:46)
at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:32)
at com.marklogic.contentpump.BaseMapper.runThreadSafe(BaseMapper.java:51)
at com.marklogic.contentpump.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:376)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
In the Query Console things work as expected, transforming a JSON variable into an XML document as expected...
What am I missing here?
tx
hugo
As per https://docs.marklogic.com/guide/ingestion/content-pump#id_82518 there seems a few thigns missing.
You are not specifying a document-type to store (-document_type xml) - you are only storing xml, but using a "documents" as the input type (assuming these are .json extension?) - so the code doesnt know that the transform is converting from json to xmls.
You are not changing the URI - so the default mime mappings will not know that your input and output types expect to differ:
https://docs.marklogic.com/guide/ingestion/content-pump#id_17589
No matter what suffix you use it wont work for a JSON input and XML storage without additional information supplied (see above links)