User:Cwebber/braindumps

From GNU MediaGoblin Wiki
Revision as of 16:44, 23 July 2011 by Cwebber (talk | contribs) (Added indexing docs)
Jump to navigation Jump to search

Braindumps! Go, go, go!

Database stuff

Migrations

What are migrations?

Sometimes the way we store data changes. We might add a new field or deprecate an old field. Even though MongoDB allows us to store things very "flexibly", it's important that as we change our schema (add new fields, remove fields, rename fields, restructure some data) that our database is updated to have things stored in the same form that we expect to access them in our codebase.

How to run

Pretty simple! Just run:

./bin/gmg migrate

How to add a new migration

Migrations are handled in:

mediagoblin/db/migrations.py

Migrations aren't too complex! Basically they're just python arguments that take a pymongo database as their sole argument.

Note, this is a pymongo database and NOT a mongokit database! The reason for this is we don't want people to use the ORM to avoida a chicken and egg type paradox: our ORM might have tools that expect things to be at our current schema

So we force people to use the simple pymongo database API, which is itself not too hard!

Let's look at one from the unit tests:

@RegisterMigration(1, TEST_MIGRATION_REGISTRY)
def creature_add_magical_powers(database):
    """
    Add lists of magical powers.

    This defaults to [], an empty list.  Since we haven't declared any
    magical powers, all existing monsters should
    """
    database['creatures'].update(
        {'magical_powers': {'$exists': False}},
        {'$set': {'magical_powers': []}},
        multi=True)

This is a fairly simple example where we're using the update command to find all instances where there aren't any magical powers, and we set the magical powers thusly to an empty list. For more on update commands and mongodb, see:

http://www.mongodb.org/display/DOCS/Updating

You'll notice that preceding the migration is a decorator called RegisterMigration. This takes two arguments: the number of this migration (increment this by one from whatever the last migration was) and the "migration registry" we'll be storing it in. (Probably MIGRATIONS). This decorator will thusly record your migration in the registry as being this version. (This is compared against the "current_migration"'s value stored in your database's 'app_metadata' collection in the '_id': 'mediagoblin' document so it knows what new migrations must be run.)

Another example, from the real migrations in migrations.py:

@RegisterMigration(1)
def user_add_bio_html(database):
    """
    Users now have richtext bios via Markdown, reflect appropriately.
    """
    collection = database['users']

    target = collection.find(
        {'bio_html': {'$exists': False}})

    for document in target:
        document['bio_html'] = cleaned_markdown_conversion(
            document['bio'])
        collection.save(document)

This one is slightly more compliated, but still not too hard. It's just taking the value from the key 'bio', running it through the cleaned_markdown_conversion, and storing that result in bio_html.

(Just one more note, if you're using the update method, for now please don't use the '$rename' modifier as it isn't supported in versions of mongodb used in most current stable distributions.)

Indexing

Some of the following is extracted straight from the mediagoblin/db/indexes.py

Running latest updates / deprecation of indexes

./bin/gmg migrate

Yes, this is the same as the migration command.

For developers

Overview

Quick summary on indexes generally:

  • Basically, indexes make querying fast. MongoDB doesn't auto-create indexes though, we have to specify them.
  • Core things we're working on require indexes. Querying on multiple keys at once requires a multi-key index... MongoDB lacks an algorithm to combine multiple single-key indexes currently
  • The ordering of keys in multi-key indexes matter.
  • Adding new queries, or adding new fields, etc... maybe discuss whether or not an index is appropriate! New indexes do have a performance and memory penalty, but not using indexes means a query slowness penalty.

For those touching indexes, you should read:


To add new indexes

Indexes are recorded in the following format:

ACTIVE_INDEXES = {
    'collection_name': {
        'identifier': {  # key identifier used for possibly deprecating later
            'index': [index_foo_goes_here]}}

... and anything else being parameters to the create_index function (including unique=True, etc)

Current indexes must be registered in ACTIVE_INDEXES... deprecated indexes should be marked in DEPRECATED_INDEXES.

Remember, ordering of compound indexes MATTERS.


To remove deprecated indexes

Removing deprecated indexes is the same, just move the index into the deprecated indexes mapping.

DEPRECATED_INDEXES = {
    'collection_name': {
        'deprecated_index_identifier1': {
            'index': [index_foo_goes_here]}}

... etc.

If an index has been deprecated that identifier should NEVER BE USED AGAIN. Eg, if you previously had 'awesomepants_unique', you shouldn't use 'awesomepants_unique' again, you should create a totally new name or at worst use 'awesomepants_unique2'.

The reason for this is because the index name is how we track whether or not the index is installed. Using the same name makes this difficult. So just use a new name!