I have 5 text-fields which I want to import into a MySQL/MariaDB database. But there are two problems:
(1) The files are quite large: 0.5 GB to 10 GB
(2) All relevant keys have 40 characters
Point (1) I have to accept as it is and I can't change it. Point 2 is my concern. There are a lot of suggestions in the internet. For example, to use enum for varchar or to use numeric surrogates. There is no problem to add a surrogate key to a table. But the same surrogate key has to be added to other tables. And this is the point where I stuck.
Here the specific information about the files/tables:
table invoice has 3 columns and 20 Mio rows:
- invoice_id (primary key) with distinct values = number of rows
- praxis_id with 4,000 distinct values
- patient_id with 4 Mio distinct values all columns are CHAR(40) and have a fixed length of 40.
table diagnose has 3 columns and 25 Mio rows:
- invoice_id CHAR(40) 1.4 Mio distinct id
- diagnose_type
- diagnose_code
table patient has 5 columns with 5 Mio rows:
- patient_id CHAR(40) not unique (4 Mio distinct pat_id)
- praxis_id CHAR(40)
- year of birth, sex etc.
For example, I want to join invoice with diagnose and patient. It makes sense to index the keys. One way would be to define invoice.invoice_id as primary key and for all other keys in table invoice I would add an index. The same with table diagnose (invoice_id with INDEX) and patient (patient_id as primary key).
The problem is that it took a long time to define invoice.invoice_id as primary key using:
ALTER TABLE invoice_id ADD PRIMARY KEY(invoice_id);
After one hour I killed the process. I think that one problem of performance arises from the kind of datatype of invoice_id in table invoice. One idea could be to add an autoincrementing surrogate key invoice_id_surr when loading the text file. But nevertheless the problem remains if I want to join with table diagnose since I have to join with invoice_id of table diagnose which has not the surrogate key invoice_id_surr as foreign key. I could add an index on diagnose.invoice_id but then I loose the advantage of having a surrogate key on table invoice.
I would be interested in a strategy how to cope with this problem: Several already existing tables which can be joint together but the keys are CHAR(40) and have no index.
Thanks for help.
UPDATE 1: Table specification
- keys have 40 characters [0-9][A-Z]
- These are tables which won't change anymore (no inserts)
-- invoice_id is primary key (unique)
-- patient_id and praxis id for foreign key and not unique in this table
CREATE TABLE invoice (
invoice_id CHAR(40) DEFAULT NULL
, praxis_id CHAR(40) DEFAULT NULL
, patient_id CHAR(40) DEFAULT NULL
, PRIMARY KEY (invoice_id2)
) ENGINE = InnoDB
;
LOAD DATA LOCAL INFILE 'C:/data/invoice.txt'
INTO TABLE invoice
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
;
-- invoice_id is not unique in this table
CREATE TABLE diagnose (
invoice_id CHAR(40) DEFAULT NULL
, diagnose_katalog VARCHAR(20) DEFAULT NULL
, diagnose_code VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
-- patient_id is not unique in this table since since patient may change praxis
CREATE TABLE patient (
patient_id CHAR(40) DEFAULT NULL
, praxis_id CHAR(40) DEFAULT NULL
, sex CHAR(1) DEFAULT NULL
, birth_year SMALLINT UNSIGNED DEFAULT NULL
, zip_code VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
You should avoid using natural keys as primary keys in your database for a variety of reasons, which can be found quite easy by googling.
Also, what does it mean "tables are given"? You should not insist on having a one-to-one correspondence between your text files and database tables. Instead, you should design your tables in the way that most suits your needs. Strive for normalization.
For example, you say that
patient_id
inpatient
"table" (you mean file, right?) is not unique. Clearly you need a table where patients are unique; so create a table with distinctpatient_id
and their attributes.patient_id
should be a unique key on that table, but generate a numeric surrogate (an auto_increment field, for example) to serve as a primary key in that table. After that, for example, create apraxis
table, withpraxis_id
as a unique key and a numeric surrogate for primary key. Then you can connectpatient
andpraxis
with a third table, as appropriate for many-to-many relationships. That way you normalize your database: a patient with its attributes is always a single row in a singlepatient
table; whereas now you have the same patient entered multiple times into you currentpatient
file/table, which will cause you trouble sooner or later.