Falcon 存储引擎设计窥探

原文出自: http://www.mysqlperformanceblog.com/2007/01/12/falcon-storage-engine-des...

Falcon Storage Engine Design Review

Now as new MySQL Storage engine - Falcon is public I can write down my thought about its design, which I previously should have kept private as I partially got them while working for MySQL.

These thought base on my understanding, reading docs, speaking to Jim, Monty, Arjen and other people so I might miss something, something might be planned to be changed or not 100% correct but anyway you might find it interesting.

In many cases what I find good or bad would base of my MySQL use with existing applications - if you design new applications which are done specially for Falcon you might find those aspects positive.

[-] No Spinlocks Falcon does not use Spinlocks. It has its own lock implementation which does operation system wait if resource is unavailable. We’ll see where this leads us but I believe on multiple CPU systems you need some spinning done, at least for some types of locks where conflicts will be frequent. And now even laptops are getting multi cores so we can forget about spinlocks wasting CPU time on single CPU boxes without much of the gain. Spinlocks need to be wise though.

[+] Row Cache The fact Falcon has separate row cache is great as it means it can have much more data cached in the same cache size. You might need only single row from 8-16K page but page cache requires you to cache whole page, so needing 1000 of 100 bytes rows cached, which are all on different pages you need 100K of memory not 8-16M as you need with page cache. There are design techniques though to improve page cache efficiency in this respect, so in practice difference can be less.

[-] Not fixed cache size, but size range This is minor one. Unlike you would typically see Falcon uses minimum and maximum sizes for row cache instead of single size, this is complicated. If I have 8GB of RAM I can assume I want 4GB allocated to row cache and I do not really care about how these 4GB will be managed, to keep it the most efficient. Should I specify min size of 3GB or 3.8GB ? What is the difference ? Will purge run well too frequently if I keep them close. Manual does not say much on this matter at this point.

[-] Few configuration options This one is probably a bit contradicting to previous one but I put on my experimentator Hat here, not one of general user. I would like to see more configuration options to adjust database behavior to match my needs. I love this Zero administration thing and I would like database to manage things for me but it should first prove it can do it better than I do, and this can be only checked by comparing performance with your own optimized settings and one database engine would like to use itself. At this point I have not seen auto-configuration algorithms to be able to cover all kinds of workloads - they may deal well with something typical but not if you’re doing something unusual, which people do quite frequently with MySQL :)

[+] Compact Indexes Compact Indexes are great. Huge indexes is where large portion of Innodb performance problems come however. We should however see how “compact” they would be in real world cases and how much performance it will give. As “SHOW TABLE STATUS”is not showing index sized in the release we tested it is hard to see how small indexes really are.

[-] No “using index” Falcon always have to check the row data even if you retrieve only columns which are in the index. I think this is big gotcha as having index covered queries is great optimization for wide range of queries.

[+] Ordered data reads Unlike other storage engines which will read row data as they traverse the index Falcon can optimize it by reading index first (possibly many indexes) and only then reading row data in sorted order. This can help a lot if you traverse significant portion of the table (or data which is locally stored but in random order). In my opinion this however should have been implemented outside of MySQL Storage Engine level for MySQL - reading rows in their physical order can be applied to most storage engines, it certainly can be applied for MyISAM and Innodb. I remember there were plans to implement such handler interface but I’m not sure there they are now.

[-] No Clustering Index support Unlike Innodb Falcon does not cluster data by Primary Key or any other key. This is good for some applications bad for others. There is large amount of applications out where which are optimized for Innodb which does have clustering, this is why I have it with Minus sign. In my opinion Optional clustering is best, in fact this is what you can get in Innodb (by using its internal hidden primary key) but this functionality is not exported well as it requires you to get rid of primary keys etc. Some presentations say Falcon does not need clustering because of optimized read by the index. This is not the
case. If you access small primary key ranges with Innodb you will need to perform just couple of page reads, with Falcon if rows were inserted in different times a lot of scattered reads will be needed. Think about typical use of Innodb clustering - users mailbox, clustered by (user_id,message_id).

[+] Row compression Falcon uses some nice fast row compression methods, for example using only as many bytes for integer as it requires, not storing column value if it is default etc. This is nice even though may show strange effects if you do not know about this feature, for example changing default value needs table rebuild even though for other storage engines it often could be done simply by changing meta data (it is not implemented in MySQL anyway though). Very interesting to see how this one will compare to transparent gzip page compression which is being implemented for Innodb right now.

[-] Tablespace per database You may notice right now each database in Falcon gets its own tablespace and its own set of logs. With typical for MySQL use of Databases just as directories it can be the problem - if transaction spawns multiple databases you need multiple log flushes on commit, plus ether you use XA to synchronize these which is expensive or it will be possible for your transaction changes to commit in one database and not in the other. The other issue you will see is having consistent snapshots being consistent for database only, so if you have started transaction accessed table in database A and a while after accessed table in database B, you will see records in B which were modified/inserted after you started transaction.

[-] Isolation Modes So far limited amount of isolation modes is supported, including no support for SELECT FOR UPDATE. This is also why Falcon has Optimistic locking concurrency which can give you problems if you have lock intensive workload so many update queries have to wait for row level locks. With Falcon at this point situation is rather funny, it will still wait for other transaction which updated the row to commit and then fail with “ERROR 1020 (HY000): Record has changed since last read in table ‘t’” error message. If you find Falcon or Innodb behavior better still prepare for this to be area where things are different and your application may need to be changed. This is also area where detailed documentation is missing so far.

[-] No protection from partial page writes This means if single page write was not atomic, so only part of page was changed while other part remains old your database may end up in non-recoverable stage. This is why Innodb has “innodb double write buffer”. Jim does not believe this is the problem but I remember having problem not the once and twice before Innodb Double Write Buffer was implemented and seeing some pages recovered from it after it was implemented. Speaking with Oracle and PostgreSQL developers it looks like they all acknowledge this problem and have techniques to deal with it, so I’m a bit surprised why Jim does not believe in it.

[+/-] Only committed changes go to the disk Innodb modifies data in the database when you change it, Falcon only when it is committed. This has some good sides such as no problem with very long rollbacks on recovery but also means you should have enough memory to hold all transaction changes. I agree most transactions are small and should be fine but if you have long running transactions, for example batch jobs you may end up in trouble or be forced to rewrite them to commit often.

[+] Blob handling optimizations Jim loves BLOBs as his own child, so Falcon is optimized in Blob handling for example having direct blob writes to the database and probably some others. This would be very interesting to see how these optimizations will show themselves in practice.

The other interesting point which I have not found much information about is regarding handling fragmentation - what happens on row updates, are they stored in the same position or in another location ? Does row ever split into multiple parts like with MyISAM or always stored as single piece ? These are going to be pretty important for IO bound workloads.