MySQL Resources [Archive] - Website Publisher Forums

Todd W

03-24-2006, 01:51 AM

I'm looking for any good websites about writing and optimizing queries -- beyond the basics. Any recommendations?

Hey Dan what do you consider "the basics"?

I would assume you mean using Explain etc as the basics? Not to sure what you know/have done... post up what you have done.

HUGE improvements come by properly indexed databases and seperate tables etc.

I recently did a LOT of db optimizing... if you have specific questions let me know.

Dan Grossman

03-24-2006, 05:31 AM

Hey Dan what do you consider "the basics"?

I would assume you mean using Explain etc as the basics? Not to sure what you know/have done... post up what you have done.

HUGE improvements come by properly indexed databases and seperate tables etc.

I recently did a LOT of db optimizing... if you have specific questions let me know.

Yes, something like SitePoint's article on adding indexes to a table I consider to be pretty basic.

How can I determine the proper memory settings for the mySQL configuration for a given server and application?

Is there any way to speed up queries grouped or sorted by aggregates that can't be indexed?

Is there any way to speed up queries grouped or sorted by variable length TEXT fields?

Perhaps there's an article that discusses the benefits of retrieving information using different types of joins and subqueries over eachother, since there are many ways to accomplish the same thing, with drastic performance differences.

I'm not sure exactly what I'm looking for... that magic document that explains how some sites can deal with millions of pieces of data much faster than the quarter second it takes to "select [basic columns] from [table] order by [basic column]" from one of my 10,000 row tables.

Chris

03-24-2006, 08:17 AM

If you're unfamiliar with the concept of normalization you should read this:

http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

If you have a mysql heavy application I would look to do the same types of optimizations high traffic forums have to do.

Dan Grossman

03-24-2006, 02:10 PM

Do you have any specific examples of those optimizations?

I believe the database schema I'm using is good. It's not completely normalized because of the huge performance hits it'd take to write 6 or 7 tables every time new data had to go in instead of one or two. There are some summary tables that are batch updated, but if I make a summary table for every type of report I want to build from the data I'll simply run out of disk space or processing time.

Take a simple two column table like this:

CREATE TABLE example (ID INT PRIMARY KEY NOT NULL AUTO_INCREMENT, name TEXT);

Say you want to count up how many rows there are for each name in that table and sort those counts. You'd do

SELECT COUNT(ID) AS count FROM example GROUP BY name ORDER BY count
The names can be arbitrarily long, and my own tests show that a fixed length text index won't change the time it takes to perform the query, nor can you index the count at all. EXPLAIN is going to show something like this:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE example ALL NULL NULL NULL NULL 10000 Using temporary; Using filesort

With as little as 10,000 rows you start taking tenths of a second instead of hundredths of a second to get the result set of this simple query on this simple table. Given you don't have the storage space or CPU time to keep counts in another table, how do you optimize this? How do you deal with a million rows like this? What settings would improve performance of this query?

Chris

03-24-2006, 02:26 PM

I think within SQL there is only so much you can do, the rest is server optimization/caching.

vbulletin has a whole area of their forums for server optimizations, I've found that to be a good place to find info on mysql config edits that can speed performance.

Dan Grossman

03-24-2006, 02:27 PM

vbulletin has a whole area of their forums for server optimizations, I've found that to be a good place to find info on mysql config edits that can speed performance.

I'll check it out, thanks :)

Dan Grossman

03-24-2006, 03:15 PM

I went through every "server optimization request" thread on the first four pages of that forum and read all the responses given there for various servers. I applied what seemed most appropriate for the hardware I have.

There's no change in the processing time of queries. Those similar to the example above still take exactly the same time (and some of the more complex ones I'd love to get some help with -- joins on subqueries and such, still take 1-2 full seconds).

What the effect of the advice they're handing out seems to be a large bias towards caching. A lot of memory is allocated to it. I can see the results in my tests where many more queries, even complex ones with grouping and sorting on many rows, are being entirely cached and even after viewing several pages are still in the cache.

So there's a benefit but not in the first time each query is run, which will still account for a high amount of the CPU time this server will spend.

I'll keep an eye on that forum in case any interesting tips come up.

AndyH

03-24-2006, 06:08 PM

Are you making these summary tables HEAP tables?

Using "USE INDEX()", "FORCE INDEX()" ?

Using InnoDB tables for tables which are read/written from often - row level locking instead of table locking?

Why in your example make "name" TEXT? VARCHAR would be better and index it.

Dan Grossman

03-24-2006, 06:21 PM

Are you making these summary tables HEAP tables?
Batch-updated MyISAM tables.

Using "USE INDEX()", "FORCE INDEX()" ? There's really no index type that applies to TEXT fields of arbitrary length, is there?

Using InnoDB tables for tables which are read/written from often - row level locking instead of table locking?This wouldn't help this particular case. Every user has their own set of tables, so while many users doing these long queries is a problem I'm trying to solve, the queries aren't on the same tables so it's not a locking issue.

Why in your example make "name" TEXT? VARCHAR would be better and index it.
An example of the real problem I have. The length of the fields can be arbitrarily long -- full URLs (very long query strings), the titles of webpages, etc. Since I want to be able to provide links, I can't cut off the URLs using a fixed length field and index.

AndyH

03-24-2006, 06:35 PM

Dan Grossman

03-24-2006, 06:38 PM

Maybe a FULLTEXT index? I don't use them so i'm not sure.

Why not make them batch-updated HEAP tables?

As for your example, there is only so much you can do with mySQLs configuration and query optimizing. Faster/better hardware, caching, and replication/clustering is how bigger sites deal with it.

I tried it for the heck of it a few days ago, and the FULLTEXT index didn't improve things (it wasn't used, and wasn't listed as a possible index). I don't think they're meant for speeding up a group by type query.

You're suggesting for 8,000 users I keep 16,000 summary tables in memory? :)

If I have time next week I'm going to install MySQL 5 on one of my spare computers and check out performance of views, but I can't imagine selecting from a view would be faster than using the exact same query generating the view as a subquery.

AndyH

03-24-2006, 06:44 PM

Try this: (with `name` as a FULLTEXT index)

SELECT COUNT(*) AS count, name FROM example FORCE INDEX(name) GROUP BY name ORDER BY count DESC;

You're suggesting for 8,000 users I keep 16,000 summary tables in memory?
16,000 tables? Can you give me an example of a summary table?

Dan Grossman

03-24-2006, 07:02 PM

Try this: (with `name` as a FULLTEXT index)

Showing rows 0 - 29 (2186 total, Query took 0.8125 sec)
SQL-query:
SELECT COUNT( * ) AS count, webpage
FROM _1_hits
FORCE INDEX ( ftIdx )
GROUP BY webpage
ORDER BY count DESC
LIMIT 0 , 30

That's the same amount of time it takes without the index; it wasn't used.

16,000 tables? Can you give me an example of a summary table?

This is a web stats service, like Google Analytics.

This is one table every user has:
(ID, date, uniqueVisitorCount, pageviewCount, returningVisitorCount, updatedTime)

Those counts are derived data from the full log of hits to their site. Since there are millions to billions of hits recorded per day (every hit to every user of the site's website!) I can't simply update every table like this on every hit. They also allow me to provide this type of "summary stat" for a much longer period than I keep around the entire log.

AndyH

03-24-2006, 07:29 PM

Dan Grossman

03-24-2006, 07:44 PM

I think you could do the summary tables better by only having one. Is this possible to do from your application because if you have alot of tables there will also be overhead from finding the correct table - for every query to the database.

The summary table will be large, but say if you add "MemberID" MEDIUMINT(8) UNSIGNED and index it should be faster than thousands of seperate tables - and alot more organized.

What do you mean overhead of finding the correct table? You mean the length of time it takes for PHP to parse "$table = $userId . '_summary';" versus "$table = '_summary';"? I think I can spare the handful of clock cycles...

Why not use a single summary table?

1) It's going to get huge. Number of days times number of users in size. In other words, after running the site for a year, with 8,000 users (current size), there'd be 3 million rows. Large tables are more prone to crashing, and if this table crashes, the site breaks for every user instead of just one or few. In case of irrecoverable corruption, data loss is for every user instead of one or few.

2) It's going to get huge. The time to simply seek to the desired rows in the table to pull results goes up.

3) It's going to get huge. The time to perform aggregates and do grouping (get all stats by month? by year?) will grow more than linearly for every user as time goes on. And even a single day into the site collecting statistics, it's going to take longer to query this table than if they had their own.

4) You haven't given any advantage to having a single table.

You can change my mind, none of the users know this new version I'm developing exists, but you've gotta give me a good reason :)

Chris

03-24-2006, 08:06 PM

All the reading I've done says that you shouldn't do what you're doing. Its called horizontal something as I recall. It is why indexes exist though.

You are right though about different tables providing some protection in case of a crash.

Dan Grossman

03-24-2006, 08:17 PM

All the reading I've done says that you shouldn't do what you're doing. Its called horizontal something as I recall. It is why indexes exist though.

Yes, completely rationalizing the schema means every page hit would have separate inserts into no less than 20something tables.

Even if I had a separate table for webpages, referrers, search engines, search keywords, browsers, operating systems, screen resolutions, screen color depths, uniques, pageviews, sessions, visitors... and joined them all on indexes... I would have the same performance questions I'm asking about which have to deal with grouping and sorting. The only difference is those fields I'm grouping and sorting by (webpages, referrers, search engine names, etc) would be in separate tables instead of one or two, and to get to them to group/sort them, I'd have to join all those 20some tables together first.

The reading you've done and pointed me to is about reducing data duplication. That's the ultimate goal of rationalization. It is not concerned with performance. Even the mySQL manual makes many references to using summary tables instead of rationalizing for performance.

If you need to collect statistics from large log tables, use summary tables instead of scanning the entire log table. Maintaining the summaries should be much faster than trying to calculate statistics “live.” Regenerating new summary tables from the logs when things change (depending on business decisions) is faster than changing the running application.

Normally, you should try to keep all data non-redundant (observing what is referred to in database theory as third normal form). However, there may be situations in which it can be advantageous to duplicate information or create summary tables to gain more speed.

Dan Grossman

03-24-2006, 08:30 PM

P.S. I bought Practical Web Database Design (http://www.amazon.com/gp/product/1590591941/102-8408545-5110523?v=glance&n=283155) in 2004, it's the book Rudy/r397 at SitePoint co-authored. That's where I learned about what you're talking about.

AndyH

03-24-2006, 08:49 PM

What do you mean overhead of finding the correct table? You mean the length of time it takes for PHP to parse "$table = $userId . '_summary';" versus "$table = '_summary';"? I think I can spare the handful of clock cycles...

7.4.9. Drawbacks to Creating Many Tables in the Same Database

If you have many MyISAM or ISAM tables in a database directory, open, close, and create operations are slow. If you execute SELECT statements on many different tables, there is a little overhead when the table cache is full, because for every table that has to be opened, another must be closed. You can reduce this overhead by making the table cache larger.

1) Sites have used much bigger tables. Do you mean a crash where all you need to do is "REPAIR TABLE `table`"? Backup your database...

2) Shouldn't do with correct indexes and use of them.

3) It should be exactly the same. You have "MemberID" indexed and when grouping by you use "USE INDEX (MemberID) WHERE `MemberID` = '$memberID'" so it gets all rows for that member then does the grouping - just like with seperate tables.

4) More organized, probably (I havn't done the swap from how you are to what I am suggesting so I can't say it will be) alot faster because of the "Drawbacks to Creating Many Tables". I take that to be talking about 100-200 tables, when you are doing it with 16,000+ then I think it definatly applies.

Dan Grossman

03-24-2006, 08:57 PM

1) Sites have used much bigger tables. Do you mean a crash where all you need to do is "REPAIR TABLE `table`"? Backup your database...

Crashing and corruption is, in fact, my primary reason for not wanting to do this. This is how the current w3counter.com does it in a few places. In the version running right now, there is one table for every user containing their "log", and a couple summary tables that container userIds. These tables spontaneously become corrupt if load becomes high for some reason. When that happens, 8000 counters break and 8000 people can't view summary stats. Their individual log tables keep going, so the data can be rebuilt, but until I notice it and fix it, their counter images are broken and the reports are broken.

The same goes for my image host. There is one bandwidth tracking table that has several million rows in it. This is the table that becomes corrupt most often on the server, happening at least once a month. I won't notice until images hosted there stop showing up.

What would you do?

AndyH

03-24-2006, 09:06 PM

So how do you fix it? By running a rapair table query or dropping the table, creating it again in rebuilding the data?

Dan Grossman

03-24-2006, 09:06 PM

I checked after typing that post and look what I found:

[root@gold v3]# myisamchk --force /var/lib/mysql/counter_counter/totals.MYI
Checking MyISAM file: /var/lib/mysql/counter_counter/totals.MYI
Data records: 241428 Deleted blocks: 0
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check data record references index: 1
- check data record references index: 2
myisamchk: error: Found key at page 1613824 that points to record outside datafile
- recovering (with sort) MyISAM-table '/var/lib/mysql/counter_counter/totals.MYI'
Data records: 241428
- Fixing index 1
- Fixing index 2

I'm not exaggerating about problems with corruption. I have restarted mysql several times today but never with kill -9, always allowing it to restart properly.

AndyH

03-24-2006, 09:17 PM

I used to get that on my accounts table for one of my sites. I just do "REPAIR TABLE `accounts`" and it is all fixed. After the fifth time I just put it in my code to run that query automatically and havn't seen it since.

It just has to do with deleting and inserting a row at the same time I believe.

Dan Grossman

03-24-2006, 09:18 PM

I used to get that on my accounts table for one of my sites. I just do "REPAIR TABLE `accounts`" and it is all fixed. After the fifth time I just put it in my code to run that query automatically and havn't seen it since.

It currently takes 11 minutes to run repair on all the data for my counter site... I'm still thinking about it. I might ask for advice at the mySQL forums on this topic.

Dan Grossman

03-25-2006, 05:56 PM

I installed the Windows version of mySQL 4.1 on my desktop and it's 200% to 300% faster with these queries on the same exported data. It's an Athlon 64 3800+...

I installed it on my laptop as well, which has a P4 2.6 (non-mobile) and 512MB RAM. That's less than my server. It's executing the queries faster than my server too! Not by as much, but consistently faster.

Other sites aren't putting much load on the server. Not enough to triple query times. What could be wrong?

AndyH

03-25-2006, 07:36 PM

By checking out your thread on SP and this: http://v3.w3counter.com/my.cnf.txt it looks like your key_buffer size isn't big enough. Make your sort buffer 4M also and see how it goes.

Dan Grossman

03-25-2006, 07:44 PM

By checking out your thread on SP and this: http://v3.w3counter.com/my.cnf.txt it looks like your key_buffer size isn't big enough. Make your sort buffer 4M also and see how it goes.

I've experimented with larger sizes before (including those exact values), there's no gain here. I ended up allocating more memory to caching instead as the VB optimization forums suggest in their sample config files.

AndyH

03-25-2006, 07:48 PM

May be helpful:
http://www.databasejournal.com/features/mysql/article.php/10897_1402311_3

Even try:
http://hackmysql.com/mysqlreport - I am going to but havn't yet

Dan Grossman

03-25-2006, 07:52 PM

Andy,

All three computers are using the same configuration file. Changing variables is not going to explain why slower hardware is performing the same query with the same configuration faster.

There's something at the root of the performance deficiency that is outside the realm of mySQL's configuration. Perhaps it's due to how the RPM was built, or certain kernel versions, or P4 HyperThreading... that's why I posted the more detailed inquiry at the mysql forum and WHT.

AndyH

03-25-2006, 07:55 PM

Yeah that is weird and I have no idea what could be doing that.

Dan Grossman

03-25-2006, 07:56 PM

Right now it's my #1 concern. My desktop is doing queries in 0.5 seconds that sometimes take 2.5 seconds on the server. That's such a huge gap that if I can bridge it, a lot of my other performance worries go away.

It might actually be a good time to buy an entirely new server and just move the sites. Most of my sites, the important ones, are on a different server, so there's not much hassle.

Todd W

03-25-2006, 09:24 PM

Dan- Since it sounds like you know a lot about what you are doing in the query aspect and have messed around with the mysql conf file yourself already... Might I suggest hiring a mysql wiz to tweak your my.cnf for you?

You will need to provide all the db info such as your mysql optimization screen, etc.

An image hosting site I ran a while ago was doing ~5million image hits per-day and I was tracking each with mysql and php when I was doing well under 1 million my server overloaded.(Dual Xeon, 2gb, etc) I had a guy tweak my mysql configuration file as WELL AS my httpd config file, after that no problems up to the 5million when I sold the site.

There's only so much you can do with correct db structure and query syntax the rest you will need to make sure your server is configured properly to handle the new number of requests.

Mal

03-25-2006, 11:30 PM

Is the MySQL on a separate server? Or is it sharing it with Apache etc? If it's sharing it, then that'll be the reason for the different performance. What CPU is the MySQL server using?

I've worked on some large databases, and in my experience, all but the most basic levels of normalisation should be chucked out the window in such circumstances. Having separate tables is good - especially if you have a lot of inserts/updates, a large table with a large index will need to update that index for every insert and possibly update, which will mean it will die far sooner than the separate tables.

Dan Grossman

03-26-2006, 12:07 AM

Is the MySQL on a separate server? Or is it sharing it with Apache etc? If it's sharing it, then that'll be the reason for the different performance. What CPU is the MySQL server using?

My laptop with a slower processor and half the memory is doing these queries faster than my server, with httpd shut down. MySQL has the entire server to itself, and no queries running but my own test.

http://forums.mysql.com/read.php?24,78510,78510#msg-78510

Grant29

03-27-2006, 12:00 PM

Dan,

I've currently got a site with tables with 400,000+ rows. Some tables have millions of rows and queries rarely occur more than 1/2 second. It seems like something definately needs to be tweaked. I struggled for a while with my app, but got lots of help on the mysql discussion list. (http://lists.mysql.com/). Join the MySQL general discussion list. I'm still subscribed to it and I read the list when I have free time to pick up extra pointers.

I basically had to retweak my queries and mysql configuration many times to get the best performance. I'm no expert but try:

[NOTE: Some of these have already be tried by you, or suggested by others]

1. List your machine specs. (processor speed and RAM).
2. Post your mysql version and OS version (Linux, windows, ?).
3. Post your my.cnf file (or my.ini file on windows)
4. Post your create table syntax
5. How many rows are in the table? How big is the table in MB?
6. Try OPTIMIZING the table.
7. Adding too many indexes can be bad. I would try not to INDEX a VARCHAR or TEXT column. Integers take up less space and are quicker to sort. If you have to index a VARCHAR or TEXT, you can try only INDEXing X characters. INDEX(column_name(20)) will use a smaller index)
8. Chris's mention of normalization can help a lot. Basically you are trying to remove all duplicate data when you normalize.
9. Use the EXPLAIN syntax to see how the SELECT query is using the INDEXes to see if it's optimizing like you think.
10. Unless you need row-level locking, I'd use the MyISAM table format. If you are using this, I'd check the key_buffer_size, query_cache_size, and tmp_table_size settings in the my.cnf file.
11. Is this a dedicated webserver, or are you running multiple sites off the same mysql database? It could be that it's got other queries running and slowing down your tests. During a slow query, I sometimes SSH into the box, log into mysql and do a SHOW PROCESSLIST; and it will print out all the queries running and the time they are taking to run.

Grant

Todd W

04-11-2006, 05:26 PM

I got a lot from the following articles:

This briefly describes the normalization form 1-4
http://databases.about.com/od/specificproducts/a/normalization.htm

NF2 Sample/Help
http://databases.about.com/od/specificproducts/a/2nf.htm

NF3 Sample/Help
http://databases.about.com/od/administration/l/bldef_3nf.htm

This goes from 1NF --> 6NF (Well, kind of 6) with examples.
http://www.answers.com/topic/database-normalization