Feature Ideas/Reprocessing: Difference between revisions

From GNU MediaGoblin Wiki
Jump to navigation Jump to search
(add notes about two rationales; more concrete implementation ideas)
Line 1: Line 1:
== Rationale ==
== Rationale ==


In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.

Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:

* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p>

* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way.


Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].
Line 12: Line 18:
with output file:
with output file:
process the media
process the media
saved the processed version to the file</nowiki>
save the processed version to the file</nowiki>


After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].
Line 20: Line 26:
=== When should we try to reprocess? ===
=== When should we try to reprocess? ===


Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.


TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.
Line 37: Line 43:


In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.
In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.

Brett is convincing himself that having two task queues--one for original processing requests, and the other for reprocessing--is the best way to implement this. The two-queue approach makes it very easy to tweak scheduling algorithms in the future: it becomes very easy to see whether one or the other queue is empty, compare relative queue lengths, etc., in order to decide how tasks should be prioritized.


TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.
Line 68: Line 76:
== User visibility ==
== User visibility ==


After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for this work.
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.

Revision as of 00:15, 10 April 2012

Rationale

In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.

Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:

  • The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).

    Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.

  • Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way.

Brett plans to work on this. If you want to help, get in touch! This is bug #420.

Preparatory refactoring

Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:

prepare an output file
with output file:
    process the media
    save the processed version to the file

After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like #419.

Reprocessing design

When should we try to reprocess?

If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on why we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.

TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.

When should we start reprocessing?

There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.

Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:

  • Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff.
  • Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it.
    • In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.
    • Or maybe we should have separate queues for new media and reprocessing tasks, and alternate between them when they both have jobs.

In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.

Brett is convincing himself that having two task queues--one for original processing requests, and the other for reprocessing--is the best way to implement this. The two-queue approach makes it very easy to tweak scheduling algorithms in the future: it becomes very easy to see whether one or the other queue is empty, compare relative queue lengths, etc., in order to decide how tasks should be prioritized.

TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.

cwebber's vague thoughts

09:09 < paroneayea> I've been thinking vaguely about a few things related to 
                    that like
09:10 < paroneayea> "what if you don't have the original anymore?  Does it 
                    reprocess it into something more lossy?"
09:11 < paroneayea> "Should we set it up so that things can determine 
                    conditionally if they should be reprocessed?  Ie, if 
                    resolutions have changed, but this one was smaller than the 
                    new lowest resolution anyway?"
09:11 < paroneayea> I'm not sure what the answer to those are but I've only 
                    thought vaguely about them.

Reprocessing implementation

TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)

mediagoblin/submit/views.py -- From there you will be lead through

  • mediagoblin/media_types/__init__.py
  • mediagoblin/media_types/*/__init__.py:MEDIA_MANAGER
  • mediagoblin/media_types/*/processing.py

--Joar 18:38, 31 March 2012 (EDT)

User visibility

After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.