SQL Database Backend: Difference between revisions

From GNU MediaGoblin Wiki
Jump to navigation Jump to search
(Elrond's comments)
Line 76: Line 76:
* Working on this move could slow down other work or be hard to coordinate in parallell with other development
* Working on this move could slow down other work or be hard to coordinate in parallell with other development
* Probably does not scale up quite as high (again, it'd have to be "much higher than deviantart", which sounds like an unlikely problem we'd like to have)
* Probably does not scale up quite as high (again, it'd have to be "much higher than deviantart", which sounds like an unlikely problem we'd like to have)
* Rewriting could possibly hurt our momentum in a serious way, especially if there's a feature freeze during the transition


== Statements by some developers ==
== Statements by some developers ==

Revision as of 22:36, 16 November 2011

Introduction

MediaGoblin currently uses MongoDB as its database backend. There have been various reasons for this decision (see: Why MongoDB, yes really, look at it. It has a lot of insight what are the problems with sql). Still, the idea of an SQL backend is coming up from time to time.

...

The existence of this wiki page does not give any evidence of the SQL backend getting developed soon!

People

Even an experimental implementation of an SQL backend, that will only support basic features will need a good amount of time/energy from some (core) developers. That time/energy will not be available to other projects.

  • Some knowing sqlalchemy is needed.
  • Someone having a good idea on the thin db layer is needed (Elrond will most likely be available for helping)
  • Someone having an idea on the current queries used in the code is needed too.

Development models

  • The biggest question is: Do we want to try an experimental sql backend? This is a tricky question. If it fails, it will be a lot of wasted time/energy. And contributors wouldn't be happy to have wasted their time/energy.
  • The wasted time/energy could be limited by limiting the first target. For example one could dedide, that the sql backend is not intended to support the full feature set.
  • Then there is the question on how to develop this:
    1. Long term branch: Develop an SQL replacement in a long term branch. The good: Anything can be changed as needed. Not so good: Some people dislike long term branches for various reasons.
    2. Trying to develop this alongside the current backend, in tree.

Major Issues

The current db model uses dicts to allow flexible storage for extensions and different media types. Representing these data in SQL is not simple. The dicts are normally keyed by a string and have arbitrary values. There are basically two ways:

  1. Have a table with a key column (string) and a value column (string, json encoded). This will allow easy reconstruction of the dict for an object fetch and will allow queries on the key of the dict.
    However, it will not allow queries on the values. For example the imaginary geolocation extension will store the lat/lon there and might want to search for other media "nearby".
    If the value column is split into three columns (i.e. value_int value_string value_json), some queries, such as those on numbers, would be possible. .
    Note: The table would actually look something like this: create table super_data_table(media_id references media(id), plugin_id int, key, value_int, value_string); ... note that this design was predicted, determined by Chris Webber to be gross, and was aimed to be avoided, hence the choice of MongoDB. See Why MongoDB for details.
  2. Actually structure the things into distinct tables. The above mentioned imaginary geolocation extension would have its own table with lat/lon fields and be able to perform efficient queries on it.

Pros and Cons

MongoDB

Pros

  • Allows you to lazily flexibly store stuff like multiple media type information, plugin data, etc
  • Supposedly scales up really well!
  • Programming interface feels clean and pythonic.
  • Migrations + flexible stuff might be a bit easier
  • Shows a lot of promise
  • Seems to be pretty fast if you have the right resources and carefully tuned your database
  • We've already built our tooling around it!

Cons

  • Much, much more resource dependent on low and middle scale deployments than SQL would be
  • Seems like you have to really tune your server around the database
  • Have to think super carefully about indexes and (mostly) construct one index per common query. (That's a bit misleading, but pretty close to the truth.) Furthermore, one feels wary of creating more indexes because memory mapping means each index (and so each query) sucks up even more RAM. (At scale you also have to think about indexes for relational databases.)
  • "On the fly" queries not so easy, could be much more expensive than
  • No joins (normalized data on insertion to obviate the need for joins)
  • How much do we really care about scaling up on the database layer anyway? (What's the expected write-load of media-goblin.)
    • DeviantArt at Libre Graphics Meeting 2011 expressed that SQL + memcached works just fine for them and didn't think more complex things were necessary. That's a huge install base, and it seems hard to believe that MediaGoblin sites will hit that scale level. (How big is/are their SQL box/en? With automatic-sharding, MognoDB might be able to give better write performance per-dollar than SQL solutions of the same size.)
    • We're much more likely to run into media scaling issues before database scaling issues
    • In an ideal federated world, scaling down might be more important than scaling up because everyone will be running their own servers
  • In the end, flexibility doesn't seem to be worth much because you can't really do arbitrary queries on stuff you dump in if you want it to have any sort of speed whatsoever because of indexing considerations. (Good SQL schema design, and performance limitations of joins presents similar limitations.)
  • Though promising looking and solutions to issues keep coming up fast and hitting the core software it's also very new and not as well established.
  • It's not like migrations entirely disappeared, but they're probably easier with certain flexible things
  • Every few weeks someone brings up an SQL backend ;)

SQL(alchemy)

Pros

  • Scales down and up pretty well (possibly not as high up as MongoDB but again if it's high enough for DeviantArt...) and scaling down really does matter in our case
  • Developers are generally pretty used to it
  • SQLAlchemy has a really strong and established codebase
  • Arbitrary, unplanned queries!
  • Flexible schemas still possible but in some different ways (but not as nicely directly and certainly in no way as loosely flexible)
  • Extremely well established.

Cons

  • Would mean a *lot* of rewriting!
  • Would mean having to write a careful migration path from mongodb->SQL!
  • Not as directly flexible as MongoDB (though we can design cleverly for a certain type of flexibility)
  • Migrations with certain types of database flexibility could break in really irritating for users (and people helping the users) ways
  • Working on this move could slow down other work or be hard to coordinate in parallell with other development
  • Probably does not scale up quite as high (again, it'd have to be "much higher than deviantart", which sounds like an unlikely problem we'd like to have)
  • Rewriting could possibly hurt our momentum in a serious way, especially if there's a feature freeze during the transition

Statements by some developers

Chris Webber's weigh-in

So I think several things about the whole possible move to SQL.

First of all, after having written out the Pros & Cons of each, it seems like maybe MongoDB is a lot of extra complexity and not gains in the areas I expected it to be. The supposed win wasn't scaling (and as predicted, scaling down has been something we've had to work around carefully), it was flexibility. Does MongoDB allow for extra flexibility? (And we *do* need flexibility for MediaGoblin's design.) Yes in the sense that you can dump in whatever, but if you intend to query on any of those attributes, it's mostly no. Indexes are expensive, and we have to spend a lot of time carefully pussyfooting around them.

While planning MediaGoblin, I knew that there were two patterns for making things flexible in SQL... one of them is the table with "key, value, type". I thought that was unacceptably gross, and still do think so.

The other option is that you have a "main" table (like MediaEntry), it references what "type" it is (such as "video", "image", whatever), and external tables for the extra information for that type point to the MediaEntry via a foreignkey and provide whatever media type specific data. Similarly, use external tables for plugins. The main reason I didn't want to deal with this is because I imagined migrations becoming a convoluted mess. It wasn't that we *wouldn't need* migrations in MongoDB, it's that maybe migrations would be less nightmarish with extensible stuff involved. In retrospect this was pretty reactive to a number of frustrating times I've had to try and walk people through broken migrations. But I'm getting the sense that I'll have to walk people through database complexity or breakage as much or more in MongoDB, and the complexity of managing indexes for any sort of extensibility is as bad or worse than dealing with migrations with an extensible SQL setup.

So, okay. I think I just made a pretty compelling case for moving back to SQL. So what then?

Two options have been proposed: try and support both SQL and MongoDB at the same time, or create a branch that switches from MongoDB to SQL. I'm afraid that the former just doesn't sound like a good idea to me at all... it seems like it'll result in a system with a massively bloated codebase, hard for new contributors to work with, hard to maintain, and "worst of both worlds" types compromises. Think about this for a second: how do you map things like migrations, indexing, etc over? Do we really want to completely rewrite the MongoDB query tools over to SQL? Some people have said that "it looks like we have a pretty simple use of MongoDB so this layer won't be so complex." To me that sounds like classic hacker "well that can't be so hard" underestimation of the complexity of the problem. Anyway, I already see a ton of complexities, and I'm sure there are more I haven't even been able to see.

So the remaining option is to do a branch to switch from MongoDB -> SQL. There's some risk of this also... it's hard to maintain a big overhaul branch while the mainline is constantly changing. There's also a risk of fracturing, and if we change our mind and stay with MongoDB, there's even a risk of forking! Not to mention that working on a branch that's so huge that doesn't get pulled in is incredibly demoralizing.

But we can reduce all those risks if we can come to a *consensus* that this is what we want to do. So I propose at the next meeting we discuss this and try to make sure we're at community consensus before agreeing to move to SQL (if that's indeed what we intend to do) and if we do so, move to SQL *entirely*.

Here's what I envision the path to that future will look like:

  • Create a branch that prototypes all the models being switched over to SQLAlchemy, included "multiple media types" implemented with a friendly API.
  • Assuming that works nice, continue work in that branch to switch all code over to using SQL.
  • Figure out how to do migrations nicely in SQL, including with multiple media types (I have some thoughts on how to do this "nicely")
  • Create a MongoDB->SQL migration tool

Anyway, let's discuss this at next meeting!

Elrond's comments

I'm quite with Chris Webber here.

One thing, I have to add/where I'm thinking a bit different: We should try to identify tasks in the sql-migration that can be done just "now". What does that mean? Moving to sql(alchemy) means a lot of changes. And some of these changes can be done with the mongodb backend as well. For example we're currently changing the document attribute access over from doc["field"] to doc.field. This does not hurt the current code (mostly) and makes the remaining "move to sql(alchemy)" easier. This reduces the amount of work happening in a long term branch. And that's the critical work (can be dropped, is demotivating, ...). So moving as much work as possible before actually starting the sql branch will increase motivation and decrease chances of failure. This sounds like more work, because the same code possibly needs to be touched twice. This might be true. It might also mean more work, as we might need to create some extra support code to assist in moving to a new idea. I still think, this is better, because it's outside of this long lived branch.

Some sketchup in SQL

create table users(
   id integer primary key not null,
   username varchar(30) unique not null,
   email text unique,
   created timestamp default CURRENT_TIMESTAMP,
   pw_hash text,
   email_verified boolean default FALSE,
   status varchar(30) default 'needs_email_verification',
   verification_key uuid,
   is_admin boolean default FALSE,
   url text,
   bio text,
   fp_verification_key uuid,
   fp_token_expire timestamp
   );

create table file_records(
   id integer primary key not null,
   filename text array -- or json?
   );

create table media_entries(
   id integer primary key not null,
   uploader integer references users(id) not null,
   title text,
   slug text,
   created timestamp default CURRENT_TIMESTAMP,
   description text,
   state text default 'unprocessed',
   queued_media_file integer references file_records(id),
   queued_task_id text,
   fail_error text,
   fail_metadata text -- json
   );

create table media_files(
   id integer primary key not null,
   media_id integer references media_entries(id) not null,
   name varchar(30) not null,
      -- "original", "thumbnail", ...
      -- maybe also "attachment"-something?
   file integer references file_records(id) not null
   );

create table media_comments(
   id integer primary key not null,
   media_entry integer references media_entries(id) not null,
   author integer references users(id) not null,
   created timestamp default CURRENT_TIMESTAMP,
   content text
   );