Scenario: I am a novice. I have a scenario of using elasticsearch. It is a retrieval system. The retrieval requirements are relatively high and the relational database cannot meet them. The data source is an external xml file. The data volume is about 4 million for the first time. Subsequently, thousands of data will be incrementally updated every day. The incremental data is also sourced from xml files. I have two ideas: Method 1, divided into two stages: At present, we plan to parse the file to the database first (parsing to the database has been implemented, the main table is 400w, and the related table is 1000w, 400w, 700w), only the key index fields are synchronized, and the primary key is obtained during retrieval. Then assemble the data from the database through primary key query for detailed display. Directly call elasticsearch's api through scheduled task update_time or trigger method to synchronize data to elasticsearch. Current problems:
- The data is not in one table. We still need to do some joint table queries, data processing, and then synchronize it to es to facilitate retrieval. We plan to only make a nested index. The system will do some update operations and hope to synchronize them to es.
- Is data consistency guaranteed?
Method 2: When parsing the file, directly double-write the data. While writing mysql, assemble and adjust the elasticsearch api to write to elasticsearch. Think of possible problems:
- If an exception occurs in one of the parsing processes, do we have to manually process the dirty data?
- Does this method require the introduction of message mq? First write to mq, then consume mq to process the data and then write to elasticsearch (is this necessary for my data volume?)
Which way is better? I also know a little bit about flink and canal, but I feel that it is difficult to maintain. I don’t know if it can meet my needs, because we do not have professional operation and maintenance, and I have to maintain it myself. What should I do?
Is there any better data synchronization method for my usage scenario?