SQL Database Backend: Difference between revisions
(Pros and cons of SQL(alchemy) vs MongoDB) |
No edit summary |
||
Line 75: | Line 75: | ||
* Migrations with certain types of database flexibility could break in really irritating for users (and people helping the users) ways |
* Migrations with certain types of database flexibility could break in really irritating for users (and people helping the users) ways |
||
* Working on this move could slow down other work or be hard to coordinate in parallell with other development |
* Working on this move could slow down other work or be hard to coordinate in parallell with other development |
||
* Probably does not scale up as |
* Probably does not scale up quite as high (again, it'd have to be "much higher than deviantart", which sounds like an unlikely problem we'd like to have) |
||
== Some sketchup in SQL == |
== Some sketchup in SQL == |
Revision as of 14:26, 13 November 2011
Introduction
MediaGoblin currently uses MongoDB as its database backend. There have been various reasons for this decision (see: Why MongoDB, yes really, look at it. It has a lot of insight what are the problems with sql). Still, the idea of an SQL backend is coming up from time to time.
...
The existence of this wiki page does not give any evidence of the SQL backend getting developed soon!
People
Even an experimental implementation of an SQL backend, that will only support basic features will need a good amount of time/energy from some (core) developers. That time/energy will not be available to other projects.
- Some knowing sqlalchemy is needed.
- Someone having a good idea on the thin db layer is needed (Elrond will most likely be available for helping)
- Someone having an idea on the current queries used in the code is needed too.
Development models
- The biggest question is: Do we want to try an experimental sql backend? This is a tricky question. If it fails, it will be a lot of wasted time/energy. And contributors wouldn't be happy to have wasted their time/energy.
- The wasted time/energy could be limited by limiting the first target. For example one could dedide, that the sql backend is not intended to support the full feature set.
- Then there is the question on how to develop this:
- Long term branch: Develop an SQL replacement in a long term branch. The good: Anything can be changed as needed. Not so good: Some people dislike long term branches for various reasons.
- Trying to develop this alongside the current backend, in tree.
Major Issues
The current db model uses dicts to allow flexible storage for extensions and different media types. Representing these data in SQL is not simple. The dicts are normally keyed by a string and have arbitrary values. There are basically two ways:
- Have a table with a key column (string) and a value column (string, json encoded). This will allow easy reconstruction of the dict for an object fetch and will allow queries on the key of the dict.
- However, it will not allow queries on the values. For example the imaginary geolocation extension will store the lat/lon there and might want to search for other media "nearby".
- If the value column is split into three columns (i.e.
value_int value_string value_json
), some queries, such as those on numbers, would be possible. . - Note: The table would actually look something like this:
create table super_data_table(media_id references media(id), plugin_id int, key, value_int, value_string);
... note that this design was predicted, determined by Chris Webber to be gross, and was aimed to be avoided, hence the choice of MongoDB. See Why MongoDB for details.
- Actually structure the things into distinct tables. The above mentioned imaginary geolocation extension would have its own table with lat/lon fields and be able to perform efficient queries on it.
Pros and Cons
MongoDB
Pros
- Allows you to lazily flexibly store stuff like multiple media type information, plugin data, etc
- Supposedly scales up really well!
- Programming interface feels clean and pythonic.
- Migrations + flexible stuff might be a bit easier
- Shows a lot of promise
- Seems to be pretty fast if you have the right resources and carefully tuned your database
- We've already built our tooling around it!
Cons
- Much, much more resource dependent on low and middle scale deployments than SQL would be
- Seems like you have to really tune your server around the database
- Have to think super carefully about indexes and (mostly) construct one index per common query. (That's a bit misleading, but pretty close to the truth.) Furthermore, one feels wary of creating more indexes because memory mapping means each index (and so each query) sucks up even more RAM
- "On the fly" queries not so easy, could be much more expensive than
- No joins
- How much do we really care about scaling up on the database layer anyway?
- DeviantArt at Libre Graphics Meeting 2011 expressed that SQL + memcached works just fine for them and didn't think more complex things were necessary. That's a huge install base, and it seems hard to believe that MediaGoblin sites will hit that scale level
- We're much more likely to run into media scaling issues before database scaling issues
- In an ideal federated world, scaling down might be more important than scaling up because everyone will be running their own servers
- In the end, flexibility doesn't seem to be worth much because you can't really do arbitrary queries on stuff you dump in if you want it to have any sort of speed whatsoever because of indexing considerations
- Though promising looking and solutions to issues keep coming up fast and hitting the core software it's also very new and not as well established
- It's not like migrations entirely disappeared, but they're probably easier with certain flexible things
- Every few weeks someone brings up an SQL backend ;)
SQL(alchemy)
Pros
- Scales down and up pretty well (possibly not as high up as MongoDB but again if it's high enough for DeviantArt...) and scaling down really does matter in our case
- Developers are generally pretty used to it
- SQLAlchemy has a really strong and established codebase
- Arbitrary, unplanned queries!
- Flexible schemas still possible but in some different ways (but not as nicely directly and certainly in no way as loosely flexible)
- Extremely well established.
Cons
- Would mean a *lot* of rewriting!
- Would mean having to write a careful migration path from mongodb->SQL!
- Not as directly flexible as MongoDB (though we can design cleverly for a certain type of flexibility)
- Migrations with certain types of database flexibility could break in really irritating for users (and people helping the users) ways
- Working on this move could slow down other work or be hard to coordinate in parallell with other development
- Probably does not scale up quite as high (again, it'd have to be "much higher than deviantart", which sounds like an unlikely problem we'd like to have)
Some sketchup in SQL
create table users( id integer primary key not null, username varchar(30) unique not null, email text unique, created timestamp default CURRENT_TIMESTAMP, pw_hash text, email_verified boolean default FALSE, status varchar(30) default 'needs_email_verification', verification_key uuid, is_admin boolean default FALSE, url text, bio text, fp_verification_key uuid, fp_token_expire timestamp ); create table file_records( id integer primary key not null, filename text array -- or json? ); create table media_entries( id integer primary key not null, uploader integer references users(id) not null, title text, slug text, created timestamp default CURRENT_TIMESTAMP, description text, state text default 'unprocessed', queued_media_file integer references file_records(id), queued_task_id text, fail_error text, fail_metadata text -- json ); create table media_files( id integer primary key not null, media_id integer references media_entries(id) not null, name varchar(30) not null, -- "original", "thumbnail", ... -- maybe also "attachment"-something? file integer references file_records(id) not null ); create table media_comments( id integer primary key not null, media_entry integer references media_entries(id) not null, author integer references users(id) not null, created timestamp default CURRENT_TIMESTAMP, content text );