GNU MediaGoblin Wiki - User contributions [en]

Storage

2014-07-15T02:20:05Z

Shackra: /* The guts of StorageInterface and friends */

Being a media publishing platform, storage is a big deal in MediaGoblin. As such there are a few systems that are storage-related that you may encounter while doing some MediaGoblin hacking.

MediaGoblin also comes with an extensible storage interface and several implementations mapping to it: basic local file storage, OpenStack "swift" style storage... and a few more plus the ability to write your own.

= The storage systems attached to your app =

== Dynamic content: queue_store and public_store ==

Two instances of the StorageInterface come attached to your app. These are:

* '''queue_store:''' When a user submits a fresh piece of media for their gallery, before the [[Processing]] stage, that piece of media sits here in the queue_store. (It's possible that we'll rename this to "private_store" and start storing more non-publicly-stored stuff in the future...). This is a StorageInterface implementation instance. Visitors to your site probably cannot see it... it isn't designed to be seen, anyway.
* '''public_store:''' After your media goes through processing it gets moved to the public store. This is also a StorageInterface implelementation, and is for stuff that's intended to be seen by site visitors.

== The workbench ==

In addition, there's a "workbench" used during processing... it's just for temporary files during processing, and also for making local copies of stuff that might be on remote storage interfaces while transitionally moving/converting from the queue_store to the public store. See the workbench module documentation for more.

== Static assets / staticdirect ==

On top of all that, there is some static media that comes bundled with your application. This stuff is kept in:

mediagoblin/static/

These files are for mediagoblin base assets. Things like the CSS files, logos, etc. You can mount these at whatever location is appropriate to you (see the direct_remote_path option in the config file) so if your users are keeping their static assets at http://static.mgoblin.example.org/ but their actual site is at http://mgoblin.example.org/, you need to be able to get your static files in a where-it's-mounted agnostic way. There's a "staticdirector" attached to the request object. It's pretty easy to use; just look at this bit taken from the mediagoblin/templates/mediagoblin/base.html main template:

<link rel="stylesheet" type="text/css"
href="{{ request.staticdirect('/css/extlib/text.css') }}"/>

see? Not too hard. As expected, if you configured direct_remote_path to be http://static.mgoblin.example.org/ you'll get back http://static.mgoblin.example.org/css/extlib/text.css just as you'd probably expect.

= StorageInterface and implementations =

== The guts of StorageInterface and friends ==

So, the StorageInterface!

So, the public and queue stores both use StorageInterface implementations... but what does that mean? It's not too hard.

Open up: mediagoblin/storage/__init__.py

In here you'll see a couple of things. First of all, there's the StorageInterface class. What you'll see is that this is just a very simple python class. A few of the methods actually implement things, but for the most part, they don't. What really matters about this class is the ''docstrings''. Each expected method is documented as to how it should be constructed. Want to make a new StorageInterface? Simply subclass it. Want to know how to use the methods of your storage system? Read these docs, they span all implementations.

There are a couple of implementations of these classes bundled in storage.py as well. The most simple of these is BasicFileStorage, which is also the default storage system used. As expected, this stores files locally on your machine.

There's also a CloudFileStorage system. This provides a mapping to [http://swift.openstack.org/ OpenStack's swift] storage system (used by RackSpace Cloud files and etc).

Between these two examples you should be able to get a pretty good idea of how to write your own storage systems, for storing data across your beowulf cluster of radioactive monkey brains, whatever.

== Writing code to store stuff ==

So what does coding for StorageInterface implementations actually look like? It's pretty simple, really. For one thing, the design is fairly inspired by [https://docs.djangoproject.com/en/dev/ref/files/storage/ Django's file storage API]... with some differences.

Basically, you access files on "file paths", which aren't exactly like unix file paths, but are close. If you wanted to store a file on a path like dir1/dir2/filename.jpg you'd actually write that file path like:

['dir1', 'dir2', 'filename.jpg']

This way we can be *sure* that each component is actually a component of the path that's expected... we do some filename cleaning on each component.

Your StorageInterface should pass in and out "file like objects". In other words, they should provide .read() and .write() at minimum, and probably also .seek() and .close().

Storage

2014-07-15T02:14:24Z

Shackra: /* The guts of StorageInterface and friends */

Being a media publishing platform, storage is a big deal in MediaGoblin. As such there are a few systems that are storage-related that you may encounter while doing some MediaGoblin hacking.

MediaGoblin also comes with an extensible storage interface and several implementations mapping to it: basic local file storage, OpenStack "swift" style storage... and a few more plus the ability to write your own.

= The storage systems attached to your app =

== Dynamic content: queue_store and public_store ==

Two instances of the StorageInterface come attached to your app. These are:

* '''queue_store:''' When a user submits a fresh piece of media for their gallery, before the [[Processing]] stage, that piece of media sits here in the queue_store. (It's possible that we'll rename this to "private_store" and start storing more non-publicly-stored stuff in the future...). This is a StorageInterface implementation instance. Visitors to your site probably cannot see it... it isn't designed to be seen, anyway.
* '''public_store:''' After your media goes through processing it gets moved to the public store. This is also a StorageInterface implelementation, and is for stuff that's intended to be seen by site visitors.

== The workbench ==

In addition, there's a "workbench" used during processing... it's just for temporary files during processing, and also for making local copies of stuff that might be on remote storage interfaces while transitionally moving/converting from the queue_store to the public store. See the workbench module documentation for more.

== Static assets / staticdirect ==

On top of all that, there is some static media that comes bundled with your application. This stuff is kept in:

mediagoblin/static/

These files are for mediagoblin base assets. Things like the CSS files, logos, etc. You can mount these at whatever location is appropriate to you (see the direct_remote_path option in the config file) so if your users are keeping their static assets at http://static.mgoblin.example.org/ but their actual site is at http://mgoblin.example.org/, you need to be able to get your static files in a where-it's-mounted agnostic way. There's a "staticdirector" attached to the request object. It's pretty easy to use; just look at this bit taken from the mediagoblin/templates/mediagoblin/base.html main template:

<link rel="stylesheet" type="text/css"
href="{{ request.staticdirect('/css/extlib/text.css') }}"/>

see? Not too hard. As expected, if you configured direct_remote_path to be http://static.mgoblin.example.org/ you'll get back http://static.mgoblin.example.org/css/extlib/text.css just as you'd probably expect.

= StorageInterface and implementations =

== The guts of StorageInterface and friends ==

So, the StorageInterface!

So, the public and queue stores both use StorageInterface implementations... but what does that mean? It's not too hard.

Open up: mediagoblin/storage/__init__.py

In here you'll see a couple of things. First of all, there's the StorageInterface class. What you'll see is that this is just a very simple python class. A few of the methods actually implement things, but for the most part, they don't. What really matters about this class is the ''docstrings''. Each expected method is documented as to how it should be constructed. Want to make a new StorageInterface? Simply subclass it. Want to know how to use the methods of your storage system? Read these docs, they span all implementations.

There are a couple of implementations of these classes bundled in storage.py as well. The most simple of these is BasicFileStorage, which is also the default storage system used. As expected, this stores files locally on your machine.

There's also a CloudFileStorage system. This provides a mapping to [OpenStack's swift http://swift.openstack.org/] storage system (used by RackSpace Cloud files and etc).

Between these two examples you should be able to get a pretty good idea of how to write your own storage systems, for storing data across your beowulf cluster of radioactive monkey brains, whatever.

== Writing code to store stuff ==

So what does coding for StorageInterface implementations actually look like? It's pretty simple, really. For one thing, the design is fairly inspired by [https://docs.djangoproject.com/en/dev/ref/files/storage/ Django's file storage API]... with some differences.

Basically, you access files on "file paths", which aren't exactly like unix file paths, but are close. If you wanted to store a file on a path like dir1/dir2/filename.jpg you'd actually write that file path like:

['dir1', 'dir2', 'filename.jpg']

This way we can be *sure* that each component is actually a component of the path that's expected... we do some filename cleaning on each component.

Your StorageInterface should pass in and out "file like objects". In other words, they should provide .read() and .write() at minimum, and probably also .seek() and .close().

Git workflow

2014-07-14T06:15:57Z

Shackra: /* Contributing changes */

GNU MediaGoblin uses git for all our version control and we have the
repositories hosted on [http://gitorious.org/ Gitorious]. We have
two repositories:

* MediaGoblin software: http://gitorious.org/mediagoblin/mediagoblin
* MediaGoblin website: http://gitorious.org/mediagoblin/mediagoblin-website

It's most likely you want to look at the software repository -- not the
website one.

The rest of this chapter talks about using the software repository.

The short of it is: we do not use merge requests. Instead, create a "feature branch" in git that you push somewhere, and link to it on a ticket. Details below.

= How to clone the project =

Do::

git clone git://gitorious.org/mediagoblin/mediagoblin.git

= How to contribute changes =

== Tie your changes to issues in the issue tracker ==

All patches should be tied to issues in the
[http://issues.mediagoblin.org/ issue tracker].
That makes it a lot easier for everyone to track proposed changes and
make sure your hard work doesn't get dropped on the floor! If there
isn't an issue for what you're working on, please create one. The
better the description of what it is you're trying to fix/implement,
the better everyone else is able to understand why you're doing what
you're doing.

== Use bugfix branches to make changes ==

The best way to isolate your changes is to create a branch based off
of the MediaGoblin repository master branch, do the changes related to
that one issue there, and then let us know how to get it.

It's much easier on us if you isolate your changes to a branch focused
on the issue. Then we don't have to sift through things.

It's much easier on you if you isolate your changes to a branch
focused on the issue. Then when we merge your changes in, you just
have to do a {{Cmd|git fetch}} and that's it. This is especially true if
we reject some of your changes, but accept others or otherwise tweak
your changes.

Further, if you isolate your changes to a branch, then you can work on
multiple issues at the same time and they don't conflict with one
another.

Name your branches using the isue number and something that makes it clear
what it's about. For example, if you were working on tagging, you
might name your branch "360_tagging".

== Properly document your changes ==

Include comments in the code.

Write comprehensive commit messages. The better your commit message
is at describing what you did and why, the easier it is for us to
quickly accept your patch.

Write comprehensive comments in the issue tracker about what you're
doing and why.

== How to send us your changes ==

There are two ways to let us know how to get it:

=== push changes to publicly available git clone and let us know where to find it ===

''This is the preferred method of sending changes.''

Push your feature/bugfix/issue branch to your publicly available
git clone and add a comment to the issue with the url for your
clone and the branch to look at.

=== attaching the patch files to the issue ===

Run

{{Cmd|git format-patch --stdout <remote>/master > issue_<number>.patch}}

<tt>format-patch</tt> creates a patch of all the commits that are in
your branch that aren't in <tt><remote>/master</tt>. The <tt>--stdout</tt>
flag causes all this output to go to stdout where it's redirected
to a file named <tt>issue_<number>.patch</tt>. That file should be
based on the issue you're working with. For example,
<tt>issue_42.patch</tt> is a good filename and <tt>issue_42_rev2.patch</tt>
is good if you did a revision of it.

Having said all that, the filename isn't wildly important.

= Example workflow =

Here's an example workflow.

== Contributing changes ==

Slartibartfast from the planet Magrathea far off in the universe has
decided that he is bored with fjords and wants to fix issue 42 (the
meaning of life bug) and send us the changes.

Slartibartfast has cloned the MediaGoblin repository and his clone
lives on gitorious.

Slartibartfast works locally. The remote named ``origin`` points to
his clone on gitorious. The remote named ``gmg`` points to the
MediaGoblin repository.

Slartibartfast does the following:

1. Fetches the latest from the MediaGoblin repository::

git fetch --all -p

This tells <tt>git fetch</tt> to fetch all the recent data from all of
the remotes (<tt>--all</tt>) and prune any branches that have been
deleted in the remotes (<tt>-p</tt>).

2. Creates a branch from the tip of the MediaGoblin repository (the remote is named <tt>gmg</tt>) master branch called <tt>bug42_meaning_of_life</tt>:

git checkout -b bug42_meaning_of_life gmg/master

This creates a new branch (<tt>-b</tt>) named <tt>bug42_meaning_of_life</tt> based
on the tip of the <tt>master</tt> branch of the remote named <tt>gmg</tt> and checks
it out.

3. Slartibartfast works hard on his changes in the <tt>bug42_meaning_of_life</tt> branch. When done, he wants to notify us that he has made changes he wants us to see.

4. Slartibartfast pushes his changes to his clone:

git push origin bug42_meaning_of_life --set-upstream

This pushes the changes in the <tt>bug42_meaning_of_life</tt> branch to the remote named <tt>origin</tt>.

5. Slartibartfast adds a comment to issue 42 with the url for his repository and the name of the branch he put the code in. He also explains what he did and why it addresses the issue.

== Updating a contribution ==

Slartibartfast brushes his hands off with the sense of accomplishment
that comes with the knowledge of a job well done. He stands, wanders
over to get a cup of water, then realizes that he forgot to run the
unit tests!

He runs the unit tests and discovers there's a bug in the code!

Then he does this:

1. He checks out the <tt>bug42_meaning_of_life</tt> branch::

git checkout bug42_meaning_of_life

2. He fixes the bug and checks it into the <tt>bug42_meaning_of_life</tt> branch.

3. He pushes his changes to his clone (the remote is named <tt>origin</tt>):

git push origin bug42_meaning_of_life

4. He adds another comment to issue 42 explaining about the mistake and how he fixed it and that he's pushed the new change to the <tt>bug42_meaning_of_life</tt> branch of his publicly available clone.

== What happens next ==

Slartibartfast is once again happy with his work. He finds issue 42
in the issue tracker and adds a comment saying he submitted a merge
request with his changes and explains what they are.
: "merge request"? https://gitorious.org/mediagoblin says "We don't use no stinking merge requests"

Later, someone checks out his code and finds a problem with it. He
adds a comment to the issue tracker specifying the problem and asks
Slartibartfast to fix it. Slartibartfst goes through the above steps
again, fixes the issue, pushes it to his
<tt>bug42_meaning_of_life</tt> branch and adds another comment to the
issue tracker about how he fixed it.

Later, someone checks out his code and is happy with it. Someone
pulls it into the master branch of the MediaGoblin repository and adds
another comment to the issue and probably closes the issue out.

Slartibartfast is notified of this. Slartibartfast does a:

git fetch --all

The changes show up in the <tt>master</tt> branch of the <tt>gmg</tt> remote.
Slartibartfast now deletes his <tt>bug42_meaning_of_life</tt> branch
because he doesn't need it anymore.

= How to learn git =

[[BeginnersCorner#Learning_git]]

HackingHowto

2014-07-14T05:14:42Z

Shackra: /* ArchLinux / Parabola */

= Hacking HOWTO =

== So you want to hack on GNU MediaGoblin? ==

First thing to do is check out the [http://mediagoblin.org/join/ web site] where we list all the project
infrastructure including:

* the IRC channel
* the mailing list
* the issue tracker

Additionally, we have information on how to get involved, who to talk
to, what needs to be worked on, and other things besides!

Second thing to do is take a look at [http://docs.mediagoblin.org/devel/codebase.html codebase chapter] where
we've started documenting how GNU MediaGoblin is built and how to add
new things. If you're planning on contributing in python, you should be aware
of [http://www.python.org/dev/peps/pep-0008/ PEP-8], the official Python style guide,
which we follow.

Third you'll need to get the requirements.

Fourth, you'll need to build a development environment. We use an
in-package checkout of virtualenv. This isn't the convenional way to
install virtualenv (normally you don't install virtualenv inside the
package itself) but we've found that it's significantly easier for
newcomers who aren't already familiar with virtualenv. If you *are*
already familiar with virtualenv, feel free to just install
mediagoblin in your own virtualenv setup... the necessary adjustments
should be obvious.

== Getting requirements ==

First, you need to have the following installed before you can build
an environment for hacking on GNU MediaGoblin:

* Python 2.6 or 2.7 - http://www.python.org/ (You'll need Python as well as the dev files for building modules.)
* python-lxml - http://lxml.de/
* git - http://git-scm.com/
* SQLAlchemy 0.7.0 or higher - http://www.sqlalchemy.org/
* Python Imaging Library (PIL) - http://www.pythonware.com/products/pil/
* virtualenv - http://www.virtualenv.org/
* Python GStreamer Bindings - http://gstreamer.freedesktop.org/modules/gst-python.html

=== GNU/Linux ===

==== Debian and derivatives ====

If you're running Debian GNU/Linux or a Debian-derived distribution
such as Debian, Mint, or [http://bugs.foocorp.net/issues/478 Ubuntu 10.10+], running the following should install these
requirements:

{{Cmd|sudo apt-get install git-core python python-dev python-lxml python-imaging python-virtualenv python-gst0.10 libjpeg8-dev}}

==== Fedora / RedHat(?) ====

On Fedora:

{{Cmd|yum install python-paste-deploy python-paste-script git-core python python-devel python-lxml python-imaging python-virtualenv gstreamer-python}}

==== ArchLinux / Parabola ====

The following command should work (<del>not tested on a new ArchLinux / Parabola install</del>. tested, it works):

{{Cmd|pacman -S git python2 python2-lxml python2-pillow python2-virtualenv gstreamer0.10-python}}

=== Mac OS X ===

==== Mac OS X Lion ====

Download the Newest Python.

Git is already installed.

* Note for PIL and lxml, you can: pip install pil lxml

Python-lxml: http://muffinresearch.co.uk/archives/2009/03/05/install-lxml-on-osx/ with sudo

Python Imaging Library (PIL): http://code.google.com/appengine/docs/python/images/installingPIL.html#mac

Libjpeg & Libpng: http://ethan.tira-thompson.com/Mac_OS_X_Ports.html Combo Installer

==== Mac OS X Snow Leopard ====

# You will probably want to install MacPorts this will give you access to many free software packages in the same manner to apt-get and yum: https://www.macports.org/install.php
# Ensure you install Git and the command line tools: https://help.github.com/articles/set-up-git#platform-mac
# Once both of those are installed type this in your terminal and enter your password when prompted for it {{Cmd|sudo port install python27 py27-lxml py27-sqlalchemy py27-pil py27-virtualenv py27-gst-python py27-pastescript}}

=== Microsoft Windows ===

''Thanks wctype!''

==== Getting requirements ====

* Python 2.7 - [http://www.python.org/download/ Download] 
* git - [https://github.com/msysgit/git/downloads Download] 
* python-lxml - [http://pypi.python.org/pypi/lxml/2.3.5#downloads Tarball] [http://www.lfd.uci.edu/~gohlke/pythonlibs/#pil Binaries] 
* Python Imaging Library (PIL) - [http://www.pythonware.com/products/pil/ Download] 
* virtualenv - [http://pypi.python.org/pypi/virtualenvwrapper-win/1.0.8#downloads Download] 
* OSSBuild project provides reasonably up-to-date binaries of GStreamer - [https://code.google.com/p/ossbuild/downloads/list Download] 
* py-bcrypt - [https://bitbucket.org/alexandrul/py-bcrypt/downloads/ Download] 

----

'''You can help:'''

If you have instructions for other GNU/Linux distributions, Windows, or Mac OS X to set
up requirements, [http://mediagoblin.org/join/ let us know]!

== How to set up and maintain an environment for hacking with virtualenv ==

'''Requirements'''

No additional requirements.

'''Create a development environment'''

After installing the requirements, follow these steps:

* Clone the repository: {{Cmd|git clone <nowiki>git://gitorious.org/mediagoblin/mediagoblin.git</nowiki>}}
* Change directories to your new checkout: {{Cmd|cd mediagoblin}}
* Checkout git submodules: {{Cmd|git submodule init}} {{Cmd|git submodule update}}
* Set up the in-package virtualenv:
(virtualenv --system-site-packages . || virtualenv .)
* Run setup.py:
{{Cmd|./bin/python setup.py develop}}
* Init the database:
{{Cmd|./bin/gmg dbupdate}}

That's it!

If you want to make sure things are working, consider running the test suite:
{{Cmd|./runtests.sh}}

(If you have troubles in the remaining steps, consider try installing
virtualenv with one of the flags --setuptools, --distribute or possibly --no-site-packages. Additionally, if your system has python3.X as the default, you might need to do virtualenv --python=python2.7 or --python=python2.6)

If you have problems, please [http://mediagoblin.org/join/ let us know]!

== Updating an existing environment ==

'''Updating for dependency changes'''

While hacking on GNU MediaGoblin over time, you'll eventually have to
update your development environment because the dependencies have
changed.

To do that, run:

{{Cmd|./bin/python setup.py develop --upgrade && ./bin/gmg dbupdate}}

'''Updating for code changes'''

{{Cmd|git pull -u}}
{{Cmd|git submodule update}}

== Running the server ==

If you want to get things running quickly and without hassle, just
run:

{{Cmd|./lazyserver.sh}}

This will start up a python server where you can begin playing with
mediagoblin, listening on 127.0.0.1:6543. It will also run celery in "always eager" mode so you
don't have to start a separate process for it.

By default, the instance is not sending out confirmation mails. Instead they are redirected to the standard output (the console) of lazyserver.sh.

You can change this behavior setting <code>email_debug_mode</code> to <code>false</code> in mediagoblin.ini

This is fine in development, but if you want to actually run celery
separately for testing (or deployment purposes), you'll want to run
the server independently:

{{Cmd|./bin/paster serve paste.ini --reload}}

== Running celeryd ==

If you aren't using <tt>./lazyserver.sh</tt> or otherwise aren't running celery
in always eager mode, you'll need to do this if you want your media to
process and actually show up. It's probably a good idea in
development to have the web server (above) running in one terminal and
celeryd in another window.

Run:

{{Cmd|<nowiki>CELERY_CONFIG_MODULE=mediagoblin.init.celery.from_celery ./bin/celeryd</nowiki>}}

== Running the test suite ==

Run:

{{Cmd|./runtests.sh}}

== Running a shell ==

If you want a shell with your database pre-setup and an instantiated
application ready and at your fingertips....

Run:

{{Cmd|./bin/gmg shell}}

== Troubleshooting ==

== Wiping your user data ==

You can completely wipe all data from the instance by doing:

{{Cmd|rm -rf mediagoblin.db kombu.db celery.db user_dev; ./bin/gmg dbupdate}}

'''Note:'''

Unless you're doing development and working on and testing creating
a new instance, you will probably never have to do this.

== Quickstart for Django programmers ==

We're not using Django, but the codebase is very Django-like in its
structure.

* <tt>routing.py</tt> is like <tt>urls.py</tt> in Django
* <tt>models.py</tt> has SQLAlchemy ORM definitions
* <tt>views.py</tt> is where the views go

We're using SQLAlchemy, which is semi-similar to the Django ORM, but
not really because you can get a lot more fine-grained. The
[http://docs.sqlalchemy.org/en/latest/orm/tutorial.html SQLAlchemy ORM tutorial] is a great place to start.

'''YouCanHelp'''

If there are other things that you think would help orient someone
new to GNU MediaGoblin but coming from Django, let us know!

== Showing off your work with PageKite ==

If you're doing development with MediaGoblin, it's sometimes helpful to show off your work to gather feedback from other contributors. A number of the MediaGoblin developers use something called [http://pagekite.net PageKite], which is a fellow free software web service which makes temporarily showing off work on your machine easy. There's a [http://pagekite.net/wiki/Howto/UsePageKiteWithMediaGoblin/ tutorial on how to use PageKite and MediaGoblin together] available on the PageKite wiki.

If you are doing a lot of MediaGoblin development, the PageKite people have graciously offered us a good amount of bandwidth at no cost in an effort to help out fellow free software projects. If you've been making significant contributions, PM Chris Webber on freenode (who is paroneayea there) and ask if you can be added to our group plan.

== Bite-sized bugs to start with ==

Now you should visit our latest list of [http://issues.mediagoblin.org/query?status=!closed&keywords=~bitesized bite-sized issues] because squishing bugs is messy fun. If you're interested in other things to work on, or need help getting started on a bug, let us know on [http://mediagoblin.org/join/ the mailing list] or on the [http://mediagoblin.org/join/ IRC channel].

Deployment

2013-05-22T22:10:08Z

Shackra: libapache2-mod-fcgi is on main http://packages.debian.org/wheezy/libapache2-mod-fcgid

This page could use a lot of work. For now, a few smaller deployment tips!

See also: http://docs.mediagoblin.org/deploying.html (some of which may belong here)

= FCGI script =

This works great with the apache FCGID config example in the next section :) in which case you should name it "mg.fcgi".

Before use, make sure you replace '/path/to/mediagoblin/bin/python' with a real path on your server, e.g. '/srv/www/myhomepage.com/mediagoblin/bin/python'. Also replace '/path/to/mediagoblin/paste.ini'.

If you encounter problems, try executing executing the script manually, e.g. <pre>./mg.fcgi</pre>

Script:

<pre>#!/path/to/mediagoblin/bin/python

# Written in 2011 by Christopher Allan Webber
#
# To the extent possible under law, the author(s) have dedicated all
# copyright and related and neighboring rights to this software to the
# public domain worldwide. This software is distributed without any
# warranty.
#
# You should have received a copy of the CC0 Public Domain Dedication along
# with this software. If not, see
# <http://creativecommons.org/publicdomain/zero/1.0/>.

from paste.deploy import loadapp
from flup.server.fcgi import WSGIServer

CONFIG_PATH = '/path/to/mediagoblin/paste.ini'

## Uncomment this to run celery in "always eager" mode... ie, you don't have
## to run a separate process, but submissions wait till processing finishes
# import os
# os.environ['CELERY_ALWAYS_EAGER'] = 'true'

def launch_fcgi():
ccengine_wsgi_app = loadapp('config:' + CONFIG_PATH)
WSGIServer(ccengine_wsgi_app).run()

if __name__ == '__main__':
launch_fcgi()</pre>

= Apache 2 Config with fcgid =

Note that the libapache2-mod-fcgi in Debian is in the main section. libapache2-mod-fcgid can be used, but requires a slightly different configuration.

<VirtualHost *:80>
Options +ExecCGI

# Accept up to 16MB requests
FcgidMaxRequestLen 16777216

ServerName mediagoblin.example.org

Alias /mgoblin_static/ /path/to/mediagoblin/mediagoblin/static/
Alias /mgoblin_media/ /path/to/mediagoblin/user_dev/media/public/

ScriptAlias / /path/to/mediagoblin/mg.fcgi/
</VirtualHost>

= Apache Config Example =
This configuration example uses mod_fastcgi.

To install and enable mod_fastcgi on a Debian/Ubuntu based system:
<pre># apt-get install libapache2-mod-suexec libapache2-mod-fastcgi
# a2enmod suexec
# a2enmod fastcgi</pre>

Sample configuration:
<pre>
<VirtualHost *:80>
ServerName mediagoblin.yourdomain.tld
ServerAdmin webmaster@yourdoimain.tld
DocumentRoot /var/www/
# Custom log files
CustomLog /var/log/apache2/mediagobling_access.log combined
ErrorLog /var/log/apache2/mediagoblin_error.log

# Serve static and media files via alias
Alias /mgoblin_static/ /path/to/mediagoblin/mediagoblin/static/
Alias /mgoblin_media/ /path/to/mediagoblin/user_dev/media/public/

# Rewrite all URLs to fcgi, except for static and media urls
RewriteEngine On
RewriteRule ^(mgoblin_static|mgoblin_media)($|/) - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ /mg.fcgi/$1 [QSA,L]

# Allow access to static and media directories
<Directory /path/to/mediagoblin/mediagoblin/static>
Order allow,deny
Allow from all
</Directory>
<Directory /path/to/mediagoblin/mediagoblin/user_dev/media/public>
Order allow,deny
Allow from all
</Directory>

# Connect to fcgi server
FastCGIExternalServer /var/www/mg.fcgi -host 127.0.0.1:26543
</VirtualHost>
</pre>
Then, you need to make sure mediagoblin is running in fcgi mode:
<pre>cd /path/to/mediagoblin
./lazyserver.sh --server-name=fcgi fcgi_host=127.0.0.1 fcgi_port=26543</pre>

Note: there may be several ways to improve this configuration

If it is too slow and you use the deflate module, you could try to use the following option : SetOutputFilter INFLATE

= OpenShift =

Thers's a blogpost saying how to [http://www.sud0.com/gnu-media-goblin-on-openshift.html install mediagoblin on openshift].

= Juju =

There is a juju [https://juju.ubuntu.com/Charms charm] available for deploying mediagoblin into EC2 or on your local box. [https://juju.ubuntu.com/ juju] is available in Ubuntu 11.10 and later, though it is recommended that you pull it from [https://launchpad.net/~juju/+archive/pkgs the juju PPA], which includes backported packages going back to Ubuntu 11.10. To use the juju charm, install juju and configure either the [https://juju.ubuntu.com/docs/provider-configuration-local.html local provider] or one of the cloud API providers, currently the [https://juju.ubuntu.com/docs/provider-configuration-ec2.html EC2 provider] is the best supported and works not only with Amazon Web Services but also OpenStack clouds. There is also a newer native OpenStack API provider which is known to support HP Cloud.
<pre>
# if you have not bootstrapped
juju bootstrap
mkdir ~/charms
bzr init-repo ~/charms/precise
bzr branch lp:~clint-fewbar/charms/precise/mediagoblin/trunk ~/charms/precise/mediagoblin
juju deploy --repository ~/charms local:mediagoblin
juju expose mediagoblin
</pre>

Currently the charm is volatile, deploying from trunk, and deploys a single-server version of MediaGoblin only. It will eventually relate to the existing juju charms for other supported data stores to allow one to scale out their MediaGoblin instance.

= Init scripts =

== Debian init scripts ==

Joar has some scripts for running celery and mediagoblin separately that are designed for Debian.

https://github.com/joar/mediagoblin-init-scripts

== Arch Linux init scripts ==

[http://whird.jpope.org/2012/04/14/mediagoblin-archlinux-rcd-scripts Jeremy Pope has written a nice blogpost] on how to add init scripts to deploy MediaGoblin with both the python paste http server and the celery deployments separated.

If you want a simpler setup and don't want to deploy celery separately, consider either turning CELERY_ALWAYS_EAGER to true in the paste init script described above, or check out [http://chimo.chromic.org/2012/03/01/mediagoblin-init-script-on-archlinux/ Chimo's guide].

== Generic, simple init script ==

This is a super stupidly simple init script that was used for mediagoblin.com... note that this has Celery running in always eager mode.

You will need to adjust the paths appropriately. This could probably be better written!

#! /bin/sh

## Stupidly simple mediagoblin init script.
#
# Written in 2012 by Christopher Allan Webber
#
# To the extent possible under law, the author(s) have dedicated all
# copyright and related and neighboring rights to this software to the
# public domain worldwide. This software is distributed without any
# warranty.
#
# You should have received a copy of the CC0 Public Domain Dedication along
# with this software. If not, see
# <http://creativecommons.org/publicdomain/zero/1.0/>.

PASTER=/srv/mediagoblin.com/bin/paster
PASTE_CONFIG=/srv/mediagoblin.com/paste.ini
OPTIONS="--pid-file=/tmp/mediagoblin.pid \
--log-file=/srv/mediagoblin.com/paster.log \
--server-name=fcgi fcgi_host=127.0.0.1 fcgi_port=26543"

CELERY_ALWAYS_EAGER=true su -pc "$PASTER serve $PASTE_CONFIG $1 $OPTIONS" webadmin

= Running on Dreamhost.com =

===Set up your python virtualenv===

dreamhost.com servers come with python 2.4 and 2.5 installed which is too old for mediagoblin. This means you need to compile and install a newer python (don't worry this is really not difficult on dreamhost servers). In order to be able to install python packages for a local user without touching system files, we need to setup a python ''virtualenv''. Fortunately, this is not too tricky either. If your server has python 2.6 or 2.7 installed already, you can skip the [[#Install a local python|python installation stuff]] and you'll only need to [[#Setup virtualenv to install local python packages|install virtualenv]].

====Install a local python====
# Download the latest python: http://python.org/ftp/python/XXX/Python-XXX.tar.bz2
# Unpack with <pre>tar xvjf Python-XXX.tar.bz2</pre>
# ''cd'' into the directory and compile and install python locally:
<blockquote><pre>./configure --prefix=$HOME/local
make
make install</pre></blockquote>
:This will install python (I used 2.7.3) into /home/<USERNAME>/local/bin/python. You might get warnings about some modules not being able to compile. :It was tcl/tk and bzip2 mainly, but things still worked out overall.
:
:You should now be able to invoke <tt>~/local/bin/python</tt> and fall into a shell of your new python (exit with ctrl-d). Congrats, you have now a new python 2.7 that you can use. However, you will need to be able to install additional packages as a user and this is what virtualenv allows.

====Setup virtualenv to install local python packages====
# Download the latest virtualenv: http://pypi.python.org/packages/source/v/virtualenv/XXX
# Install the virtualenv: <blockquote><pre>~/local/bin/python ~/virtualenv-1.8/virtualenv.py $HOME/local/virtenv</pre></blockquote>
:: You will now have: ~/local/virtenv/bin/python
# In ~/.bash_profile add: <blockquote><pre>PATH=~/local/virtenv/bin:~/local/bin:${PATH}</pre></blockquote>
:: so that your local python will be preferred.
:: Log out, log in again and test python" to see which version will be invoked. It should be the new python. Check "which easy_install" to see if the one in local/virtenv would be executed.
::
:: You have now 1) a local ''python'' installation that you can use, and 2) ''easy_install'' will also work with your local user installation. From now on you can e.g. locally install the nose testing framework (''easy_install nose'') and use it (''python -c "import nose"'').

===Install mediagoblin as a site package===

# Check out mediagoblin from git to e.g. ~/mediagoblin
# Install MediaGoblin and all dependencies for MediaGoblin with easy_install.
:* In the mediagoblin directory issue:
:<blockquote><pre>python setup.py</pre></blockquote>
:* You will also need to: easy_install lxml
:* Python-image was trickier to install: <pre>easy_install --find-links http://www.pythonware.com/products/pil/ Imaging</pre>
::Test by leaving the mediagoblin directory and see if you can import it: <pre>python -c "import mediagoblin"</pre> looking for error messages.

===Set up an WSGI environment on Dreamhost===

# Enable the mod_passenger setting for ruby/python in the domains web panel
# ''cd'' into the domain directory (e.g. ~/media.sspaeth.de for me)
# Copy mediagoblin.ini and paste.ini from the ~/mediagoblin directory here
# Create passenger_wsgi.py in your domain directory (e.g. ~/media.sspaeth.de for me):
<blockquote><pre>import sys, os
INTERP = "/home/<username>/local/virtenv/bin/python"
#INTERP is present twice so that the new python interpreter knows the actual executable path
if sys.executable != INTERP: os.execl(INTERP, INTERP, *sys.argv)

from paste.deploy import loadapp
application = loadapp('config:/home/mediagoblin/media.sspaeth.de/paste.ini')

#If in case of errors, all you get are mysterious error 500, you can set debug=true in paste.ini to see stack traces
# Otherwise, add this:
#from paste.exceptions.errormiddleware import ErrorMiddleware
#application = ErrorMiddleware(application, debug=True)</pre></blockquote>
# Set up the database by issueing: <pre>gmg dbupdate</pre>
# (optional but recommended) Serve the static files directly from the webserver.
:In paste.ini in this section: [composite:routing] comment out the static files:
#/mgoblin_static/ = mediagoblin_static
#/theme_static/ = theme_static
:and symlink the mediagobin/mediagoblin/static directoy to public/mgoblin_static
: and mediagoblin/mediagoblin/themes to public/theme_static
:
:so it is displayed directly. This step might be different depending on your web server configuration. There is no reason that static
:files need to go through all the python indirection when they can be served directly by nginx/apache/...
:
:In case you want to delete your git checkout after the installation (you don't need it, since you installed the mediagoblin package to ~/local/virtenv/lib/python2.7/mediagoblin... you should do the symlinking from there.
===Open Issues===
Please fill in the blanks if you find them out:
* How to enable logging to a file or standard html logs? Console output just disappears.
* How to set up things not using sqlite. What's wrong with mySQL?

===Troubleshooting===
* First of all: invoke <pre>python passenger_wsgi.py</pre> directly from the shell. This will tell you if there are import errors etc. If this command exits without any output on the console, you have already achieved a lot.
* In ''paste.ini'' set debug=true. This will show you python backtraces directly in the web page
* You will need to configure an smtp user/passwd/server in mediagoblin.ini, so you actually get the account creation token mailed out. Hint, the relevant settings are: <blockquote><pre>email_debug_mode = false
email_sender_address = postmaster@sspaeth.de
email_smtp_host = SMTP.DOMAIN.TLD
email_smtp_user = USERNAME
email_smtp_pass = WEIRDPASSWORD</pre></blockquote>
* In case you can upload media, but it does not appear. You don't have the celery server running. Add <pre>CELERY_ALWAYS_EAGER = true</pre> to the ''[celery]'' section in mediagoblin.ini

=Miscellaneous Hacks=
==Force translation==
There might some conditions under which you would prefer that MediaGoblin always return the same translation. If you are deploying with nginx and fastcgi, you can force MediaGoblin to return a specific translation regardless of browser preferences by passing a specific HTTP_ACCEPT_LANGUAGE fastcgi parameter in your location block. Example last line in location block:
<nowiki>
fastcgi_param HTTP_ACCEPT_LANGUAGE es; #force spanish translation
</nowiki>

User:Shackra/Multimedia

2012-05-27T07:42:23Z

Shackra: /* Dependencies */

== Video Support ==

(from [[User:joar/Multimedia|Joar Multimedia]] page) Video and multimedia support has been merged into master. This does not mean that the information below is invalid, but that video support is considered more stable.

== Dependencies ==
To enable video transcoding in MediaGoblin with [http://parabolagnulinux.org Parabola GNU/Linux-libre], you need to have the following packages:

* python2-gtkhtml2
* gstreamer0.10-base-plugins
* gstreamer0.10-good-plugins
* gstreamer0.10-ugly-plugins
* gstreamer0.10-bad-libre-plugins
* gstreamer0.10-base
* gstreamer0.10-good
* gstreamer0.10-ugly
* gstreamer0.10-bad-libre
* gstreamer0.10-ffmpeg
* gstreamer0.10-python
* python2-gobject

=== Installing dependencies ===
{{Cmd|pacman -S python2-gtkhtml2 gstreamer0.10-base-plugins gstreamer0.10-good-plugins gstreamer0.10-ugly-plugins gstreamer0.10-bad-libre-plugins gstreamer0.10-base gstreamer0.10-good gstreamer0.10-ugly gstreamer0.10-bad-libre gstreamer0.10-ffmpeg gstreamer0.10-python python2-gobject}}

[http://cdn.memegenerator.net/instances/400x/21033858.jpg|With Parabola GNU/Linux-libre everything is so damn easy!]

User:Shackra/Multimedia

2012-05-27T07:32:55Z

Shackra: /* Installing dependencies */

== Video Support ==

(from [[User:joar/Multimedia|Joar Multimedia]] page) Video and multimedia support has been merged into master. This does not mean that the information below is invalid, but that video support is considered more stable.

== Dependencies ==
To enable video transcoding in MediaGoblin on [http://parabolagnulinux.org|Parabola GNU/Linux-libre], you need to have the following packages:

* python2-gtkhtml2
* gstreamer0.10-base-plugins
* gstreamer0.10-good-plugins
* gstreamer0.10-ugly-plugins
* gstreamer0.10-bad-libre-plugins
* gstreamer0.10-base
* gstreamer0.10-good
* gstreamer0.10-ugly
* gstreamer0.10-bad-libre
* gstreamer0.10-ffmpeg
* gstreamer0.10-python
* python2-gobject

=== Installing dependencies ===
{{Cmd|pacman -S python2-gtkhtml2 gstreamer0.10-base-plugins gstreamer0.10-good-plugins gstreamer0.10-ugly-plugins gstreamer0.10-bad-libre-plugins gstreamer0.10-base gstreamer0.10-good gstreamer0.10-ugly gstreamer0.10-bad-libre gstreamer0.10-ffmpeg gstreamer0.10-python python2-gobject}}

[http://cdn.memegenerator.net/instances/400x/21033858.jpg|With Parabola GNU/Linux-libre everything is so damn easy!]

User:Shackra/Multimedia

2012-05-27T07:32:15Z

Shackra: Created page with "== Video Support == (from Joar Multimedia page) Video and multimedia support has been merged into master. This does not mean that the information below ..."

== Video Support ==

(from [[User:joar/Multimedia|Joar Multimedia]] page) Video and multimedia support has been merged into master. This does not mean that the information below is invalid, but that video support is considered more stable.

== Dependencies ==
To enable video transcoding in MediaGoblin on [http://parabolagnulinux.org|Parabola GNU/Linux-libre], you need to have the following packages:

* python2-gtkhtml2
* gstreamer0.10-base-plugins
* gstreamer0.10-good-plugins
* gstreamer0.10-ugly-plugins
* gstreamer0.10-bad-libre-plugins
* gstreamer0.10-base
* gstreamer0.10-good
* gstreamer0.10-ugly
* gstreamer0.10-bad-libre
* gstreamer0.10-ffmpeg
* gstreamer0.10-python
* python2-gobject

=== Installing dependencies ===
{{Cmd|pacman -S python2-gtkhtml2 gstreamer0.10-base-plugins gstreamer0.10-good-plugins \
gstreamer0.10-ugly-plugins gstreamer0.10-bad-libre-plugins gstreamer0.10-base gstreamer0.10-good \
gstreamer0.10-ugly gstreamer0.10-bad-libre gstreamer0.10-ffmpeg gstreamer0.10-python python2-gobject}}

[http://cdn.memegenerator.net/instances/400x/21033858.jpg|With Parabola GNU/Linux-libre everything is so damn easy!]

Feature Ideas/Reprocessing

2012-04-20T06:03:44Z

Shackra: what I said about this issue

== Rationale ==

In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.

Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:

* The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.

* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.

Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].

== Reprocessing design ==

=== When should we try to reprocess? ===

If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.

TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry. (maybe this bug is a good start [http://issues.mediagoblin.org/ticket/438 bug#438] ?)

=== When should we start reprocessing? ===

There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.

Like scheduling code in GNU/Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.

The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:

* the media type
* whether this processing is for
** a new upload
** retry after failed processing
** reprocessing at administrator request

With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.

However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.

TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.

=== cwebber's vague thoughts ===

<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to
that like
09:10 < paroneayea> "what if you don't have the original anymore? Does it
reprocess it into something more lossy?"
09:11 < paroneayea> "Should we set it up so that things can determine
conditionally if they should be reprocessed? Ie, if
resolutions have changed, but this one was smaller than the
new lowest resolution anyway?"
09:11 < paroneayea> I'm not sure what the answer to those are but I've only
thought vaguely about them.</nowiki>

=== Shackra said... ===

Take in count the people using OpenStack for file storage. In case of re-encoding of audio files, for example, because MediaGoblin switched to ogg, will be impossible at least, if you deleted the original file and re-upload the new one.

== Reprocessing implementation ==

=== Base Task class ===

I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.

This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.

=== Splitting up tasks ===

Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.

The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media shows up in the gallery and so on.

(Alternatively, the "check if finished" task could be more stateful, keeping track of which tasks fire it off, and then performing its own work when the last task reports in. This approach seems more fragile and error-prone, so I prefer an approach that checks whether the subtasks actually did their jobs, but that might not always be possible, so I'm making a note of this.)

As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashed from all these tasks (in other words, it peeks at <tt>media_files_dict</tt>). When all the files are in place, it marks the entry as processed, and performs necessary cleanup.

We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.

== User visibility ==

After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.

This is a big enough job that it could probably justify its own feature page...

Feature Ideas/Reprocessing

2012-04-20T05:53:02Z

Shackra: Linux? WTF!

== Rationale ==

In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.

Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:

* The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.

* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.

Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].

== Reprocessing design ==

=== When should we try to reprocess? ===

If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.

TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry. (maybe this bug is a good start [http://issues.mediagoblin.org/ticket/438 bug#438] ?)

=== When should we start reprocessing? ===

There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.

Like scheduling code in GNU/Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.

The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:

* the media type
* whether this processing is for
** a new upload
** retry after failed processing
** reprocessing at administrator request

With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.

However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.

TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.

=== cwebber's vague thoughts ===

<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to
that like
09:10 < paroneayea> "what if you don't have the original anymore? Does it
reprocess it into something more lossy?"
09:11 < paroneayea> "Should we set it up so that things can determine
conditionally if they should be reprocessed? Ie, if
resolutions have changed, but this one was smaller than the
new lowest resolution anyway?"
09:11 < paroneayea> I'm not sure what the answer to those are but I've only
thought vaguely about them.</nowiki>

== Reprocessing implementation ==

=== Base Task class ===

I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.

This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.

=== Splitting up tasks ===

Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.

The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media shows up in the gallery and so on.

(Alternatively, the "check if finished" task could be more stateful, keeping track of which tasks fire it off, and then performing its own work when the last task reports in. This approach seems more fragile and error-prone, so I prefer an approach that checks whether the subtasks actually did their jobs, but that might not always be possible, so I'm making a note of this.)

As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashed from all these tasks (in other words, it peeks at <tt>media_files_dict</tt>). When all the files are in place, it marks the entry as processed, and performs necessary cleanup.

We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.

== User visibility ==

After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.

This is a big enough job that it could probably justify its own feature page...

Feature Ideas/Reprocessing

2012-04-20T05:51:39Z

Shackra: adding a possible bug to start

== Rationale ==

In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.

Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:

* The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.

* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.

Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].

== Reprocessing design ==

=== When should we try to reprocess? ===

If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.

TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry. (maybe this bug is a good start [http://issues.mediagoblin.org/ticket/438 bug#438] ?)

=== When should we start reprocessing? ===

There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.

Like scheduling code in Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.

The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:

* the media type
* whether this processing is for
** a new upload
** retry after failed processing
** reprocessing at administrator request

With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.

However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.

TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.

=== cwebber's vague thoughts ===

<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to
that like
09:10 < paroneayea> "what if you don't have the original anymore? Does it
reprocess it into something more lossy?"
09:11 < paroneayea> "Should we set it up so that things can determine
conditionally if they should be reprocessed? Ie, if
resolutions have changed, but this one was smaller than the
new lowest resolution anyway?"
09:11 < paroneayea> I'm not sure what the answer to those are but I've only
thought vaguely about them.</nowiki>

== Reprocessing implementation ==

=== Base Task class ===

I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.

This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.

=== Splitting up tasks ===

Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.

The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media shows up in the gallery and so on.

(Alternatively, the "check if finished" task could be more stateful, keeping track of which tasks fire it off, and then performing its own work when the last task reports in. This approach seems more fragile and error-prone, so I prefer an approach that checks whether the subtasks actually did their jobs, but that might not always be possible, so I'm making a note of this.)

As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashed from all these tasks (in other words, it peeks at <tt>media_files_dict</tt>). When all the files are in place, it marks the entry as processed, and performs necessary cleanup.

We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.

== User visibility ==

After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.

This is a big enough job that it could probably justify its own feature page...