I want to fetch all PMs from vBulletin as conversations like IM does, which should be used in my .NET Core library using Dapper. This means: A send message to B, B replys would be one conversation with two messages. Since this causes performance issues, I tried to figure it out using DBeaver by running the Dapper queries directly.
To fetch the conversations of a page from the inbox, I wrote the following query:
SELECT pm.pmid
FROM pm, pmtext AS txt
WHERE pm.pmtextid = txt.pmtextid
AND (pm.userid = 123 OR txt.fromuserid = 123)
AND pm.folderid != -1
GROUP BY IF(pm.parentpmid != 0, pm.parentpmid, pm.pmid)
LIMIT 0, 50
This gave me the first 50 conversation ids for the user #123. It works, but took ~440ms to execute. I tried adding indices to all relevant fields
ALTER TABLE pmtext ADD INDEX fromuserid_only(fromuserid);
ALTER TABLE pm ADD INDEX userid_only(userid);
ALTER TABLE pm ADD INDEX parentpmid(parentpmid);
but its still slow. It seems caused by the GROUP BY. Even when I just do GROUP BY pm.parentpmid (which would produce wrong data, but just for performance testing), the query run time is not better. When I remove the GROUP BY, it's pretty fast (~12ms).
My query that counts the total pages of conversations is similar without the join and its fast (< 20ms):
// DbConnection db = ...
string sqlTotalPages = @"
SELECT CEIL(COUNT(*)/ 50) AS pages
FROM pm, pmtext AS txt
WHERE pm.pmtextid = txt.pmtextid
AND (pm.userid = 18 OR txt.fromuserid = 18)";
int totalPages = db.QueryFirstOrDefault<int>(sqlTotalPages);
Why does GROUP BY slow down the query so massively? How could I improve the performance?
Table structure from vB
CREATE TABLE `pm` (
`pmid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`pmtextid` int(10) unsigned NOT NULL DEFAULT '0',
`userid` int(10) unsigned NOT NULL DEFAULT '0',
`folderid` smallint(6) NOT NULL DEFAULT '0',
`messageread` smallint(5) unsigned NOT NULL DEFAULT '0',
`parentpmid` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`pmid`),
KEY `pmtextid` (`pmtextid`),
KEY `userid` (`userid`,`folderid`),
KEY `userid_only` (`userid`),
KEY `parentpmid` (`parentpmid`)
) ENGINE=MyISAM AUTO_INCREMENT=221965 DEFAULT CHARSET=latin1
CREATE TABLE `pmtext` (
`pmtextid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`fromuserid` int(10) unsigned NOT NULL DEFAULT '0',
`fromusername` varchar(100) NOT NULL DEFAULT '',
`title` varchar(250) NOT NULL DEFAULT '',
`message` mediumtext,
`touserarray` mediumtext,
`iconid` smallint(5) unsigned NOT NULL DEFAULT '0',
`dateline` int(10) unsigned NOT NULL DEFAULT '0',
`showsignature` smallint(5) unsigned NOT NULL DEFAULT '0',
`allowsmilie` smallint(5) unsigned NOT NULL DEFAULT '1',
`reportthreadid` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`pmtextid`),
KEY `fromuserid` (`fromuserid`,`dateline`),
KEY `fromuserid_only` (`fromuserid`),
KEY `fromuserid_only2` (`fromuserid`)
) ENGINE=MyISAM AUTO_INCREMENT=118470 DEFAULT CHARSET=latin1
I think the reason why GROUP BY is causing such as increase in processing time is because of the LIMIT. When there's no GROUP BY the DB engine can stop processing rows in the query once it has found 50 that matches your criteria. With the GROUP BY clause though the entire table needs to be processed, grouped together and then the 50 first results will be returned. As for solution, would you get the correct result if you removed the GROUP BY and added "AND pm.parentpmid = 0" to the WHERE-clause? It seems the GROUP BY clause is there to remove rows with a parent from the result which is more efficiently done using WHERE (assuming all rows with a parent also has the parent present among the results).