https://wiki.mediagoblin.org/api.php?action=feedcontributions&user=Brett&feedformat=atomGNU MediaGoblin Wiki - User contributions [en]2024-03-29T13:15:08ZUser contributionsMediaWiki 1.39.5https://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=658Feature Ideas/Reprocessing2012-04-11T02:59:33Z<p>Brett: note alternative approach to "check if finished" task</p>
<hr />
<div>== Rationale ==<br />
<br />
In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.<br />
<br />
Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:<br />
<br />
* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p><br />
<br />
* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.<br />
<br />
The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:<br />
<br />
* the media type<br />
* whether this processing is for<br />
** a new upload<br />
** retry after failed processing<br />
** reprocessing at administrator request<br />
<br />
With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.<br />
<br />
However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
=== Base Task class ===<br />
<br />
I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.<br />
<br />
This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.<br />
<br />
=== Splitting up tasks ===<br />
<br />
Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.<br />
<br />
The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media shows up in the gallery and so on.<br />
<br />
(Alternatively, the "check if finished" task could be more stateful, keeping track of which tasks fire it off, and then performing its own work when the last task reports in. This approach seems more fragile and error-prone, so I prefer an approach that checks whether the subtasks actually did their jobs, but that might not always be possible, so I'm making a note of this.)<br />
<br />
As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashed from all these tasks (in other words, it peeks at <tt>media_files_dict</tt>). When all the files are in place, it marks the entry as processed, and performs necessary cleanup.<br />
<br />
We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.<br />
<br />
This is a big enough job that it could probably justify its own feature page...</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=657Feature Ideas/Reprocessing2012-04-11T02:55:49Z<p>Brett: grammar fixes</p>
<hr />
<div>== Rationale ==<br />
<br />
In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.<br />
<br />
Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:<br />
<br />
* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p><br />
<br />
* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.<br />
<br />
The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:<br />
<br />
* the media type<br />
* whether this processing is for<br />
** a new upload<br />
** retry after failed processing<br />
** reprocessing at administrator request<br />
<br />
With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.<br />
<br />
However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
=== Base Task class ===<br />
<br />
I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.<br />
<br />
This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.<br />
<br />
=== Splitting up tasks ===<br />
<br />
Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.<br />
<br />
The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media shows up in the gallery and so on.<br />
<br />
As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashed from all these tasks (in other words, it peeks at <tt>media_files_dict</tt>). When all the files are in place, it marks the entry as processed, and performs necessary cleanup.<br />
<br />
We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.<br />
<br />
This is a big enough job that it could probably justify its own feature page...</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=656Feature Ideas/Reprocessing2012-04-11T02:27:21Z<p>Brett: describe the beginnings of an implementation based on Celery tasks</p>
<hr />
<div>== Rationale ==<br />
<br />
In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.<br />
<br />
Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:<br />
<br />
* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p><br />
<br />
* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our general site performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could approach this, and no one system is going to be perfect for every site. We should instead strive to give hosts the tools they need to easily configure MediaGoblin according to their needs and their desires.<br />
<br />
The first piece of this puzzle is to make full use of Celery's task routing capabilities. Each task should use an exchange that indicates at least:<br />
<br />
* the media type<br />
* whether this processing is for<br />
** a new upload<br />
** retry after failed processing<br />
** reprocessing at administrator request<br />
<br />
With this framework in place, a host has the capability to configure Celery with different worker pools for each of these exchanges depending on their needs and preferences.<br />
<br />
However, Celery doesn't handle scheduling of tasks outside the constraints of worker pools. It's up to us to decide, and write in the code, issues like how long to wait between reprocessing attempts and when to give up completely. (For version 1, I'm planning an exponential backoff algorithm with a maximum wait of 1 day. TODO: Should there be different configuration knobs for each media type? That's a lot more complexity, but it's pretty hard to argue against the idea that expectations for processing ASCII art should be different from processing video.) Key values should be stored in and read from the global MediaGoblin configuration.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance out of the box. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
=== Base Task class ===<br />
<br />
I think we can write a common subclass of Celery's Task that will serve as the base for all of our processing tasks. It would provide an <tt>on_failure</tt> method to decide whether or not a retry is appropriate, and if so, handle rewriting the exchange (to mark this as a retry), calculate the exponential backoff time, and reschedule the task.<br />
<br />
This class could also serve as a unifying place to collect utility functions that many processing tasks need. There's a lot of file handling code that's repeated with minor variation throughout the processing tasks right now; that could be abstracted into methods of this class to reduce redundancy in the code.<br />
<br />
=== Splitting up tasks ===<br />
<br />
Right now, our processing tasks are monolithic beasts: one single task performs all of the transformations necessary for the media to be considered "processed." We could improve code readability and maintainability, site reliability, and possibly even performance by splitting tasks up appropriately.<br />
<br />
The basic idea here is that each processing task would, after successful completion, queue up a "check if finished" task, which would in turn do quick checks to see if all the necessary results of processing are in place. When it finds that they are, it marks the media entry as processed, and performs clean-up jobs like removing the original queued file, so that the media showing up in the gallery and so on.<br />
<br />
As an example: image processing includes four jobs: making a thumbnail, a medium image, stashing the original file (with slight renaming as appropriate), and saving EXIF and GPS data to the database. These four tasks could each be run individually. They all fire a "check if finished" task that examines if the media entry has files stashes from all these tasks (in other words, peeking at media_files_dict). If so, it marks the entry as processed, and performs necessary cleanup.<br />
<br />
We can potentially save a lot of work with this approach. Consider a video where transcoding succeeds but generating a thumbnail fails. By splitting tasks up, the resource-intensive transcoding will only run once, while we retry thumbnail generation appropriately.<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.<br />
<br />
This is a big enough job that it could probably justify its own feature page...</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=652Feature Ideas/Reprocessing2012-04-10T00:17:10Z<p>Brett: note when the second kind of reprocessing occurs</p>
<hr />
<div>== Rationale ==<br />
<br />
In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.<br />
<br />
Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:<br />
<br />
* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p><br />
<br />
* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way. These events should only take place when a site administrator requests it, and maybe when the site configuration changes to demand it.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
save the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. <br />
** In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
** Or maybe we should have separate queues for new media and reprocessing tasks, and alternate between them when they both have jobs.<br />
<br />
In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.<br />
<br />
Brett is convincing himself that having two task queues--one for original processing requests, and the other for reprocessing--is the best way to implement this. The two-queue approach makes it very easy to tweak scheduling algorithms in the future: it becomes very easy to see whether one or the other queue is empty, compare relative queue lengths, etc., in order to decide how tasks should be prioritized.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
<blockquote><br />
mediagoblin/submit/views.py -- From there you will be lead through<br />
<br />
* mediagoblin/media_types/__init__.py<br />
* mediagoblin/media_types/*/__init__.py:MEDIA_MANAGER<br />
* mediagoblin/media_types/*/processing.py<br />
--[[User:Joar|Joar]] 18:38, 31 March 2012 (EDT)<br />
</blockquote><br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=651Feature Ideas/Reprocessing2012-04-10T00:15:07Z<p>Brett: add notes about two rationales; more concrete implementation ideas</p>
<hr />
<div>== Rationale ==<br />
<br />
In MediaGoblin, processing refers to the act of transforming an original media file in various ways to make it suitable to serve. For example, with images, we prepare resized versions for thumbnail and gallery views. With video, we capture a thumbnail frame, and transcode a medium-sized version for embedded viewing.<br />
<br />
Normally, we process media as soon as we can after it's been uploaded to the site. Sometimes, we want to reprocess some media. There are a couple of reasons why this might happen:<br />
<br />
* <p>The original processing attempt failed. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!).</p><p>Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.</p><br />
<br />
* Something has changed on the site such that we ought to reprocess media that has already been processed. Maybe the administrator changed the size of thumbnail views, or in the future the MediaGoblin code will use a different audio codec. For an event like this, we need to reprocess all the affected existing media to make sure we can effectively serve them in the new way.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
save the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
If we're reprocessing media because previous attempts failed, we're likely to be more or less successful depending on ''why'' we failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. <br />
** In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
** Or maybe we should have separate queues for new media and reprocessing tasks, and alternate between them when they both have jobs.<br />
<br />
In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.<br />
<br />
Brett is convincing himself that having two task queues--one for original processing requests, and the other for reprocessing--is the best way to implement this. The two-queue approach makes it very easy to tweak scheduling algorithms in the future: it becomes very easy to see whether one or the other queue is empty, compare relative queue lengths, etc., in order to decide how tasks should be prioritized.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
<blockquote><br />
mediagoblin/submit/views.py -- From there you will be lead through<br />
<br />
* mediagoblin/media_types/__init__.py<br />
* mediagoblin/media_types/*/__init__.py:MEDIA_MANAGER<br />
* mediagoblin/media_types/*/processing.py<br />
--[[User:Joar|Joar]] 18:38, 31 March 2012 (EDT)<br />
</blockquote><br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for publishing this information generally. There are also specific places where we could conditionally show useful information: for instance, mention around the media submission page that the media might be slow to appear if processing queues are unusually large.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=631Feature Ideas/Reprocessing2012-03-31T15:41:38Z<p>Brett: note scheduling data storage</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. <br />
** In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
** Or maybe we should have separate queues for new media and reprocessing tasks, and alternate between them when they both have jobs.<br />
<br />
In the database, it might be prudent to store information about the media's processing history (last time we tried processing, how many failures we've had), and then use that to determine future reprocessing times as needed. That incurs a teeny bit more overhead, but leaves us much freer to change the details of our scheduling algorithm in later versions of MediaGoblin.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for this work.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=630Feature Ideas/Reprocessing2012-03-31T15:39:10Z<p>Brett: note separate queues idea</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. <br />
** In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
** Or maybe we should have separate queues for new media and reprocessing tasks, and alternate between them when they both have jobs.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for this work.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=629Feature Ideas/Reprocessing2012-03-31T15:37:29Z<p>Brett: link to bug entry</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch! This is [http://issues.mediagoblin.org/ticket/420 bug #420].<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for this work.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=628Feature Ideas/Reprocessing2012-03-31T15:33:19Z<p>Brett: add note about panel</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch!<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here). The current panel would be a good starting point for this work.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=627Feature Ideas/Reprocessing2012-03-31T15:20:18Z<p>Brett: reorganize to better split design notes from implementation</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch!<br />
<br />
== Preparatory refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Reprocessing design ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== Reprocessing implementation ==<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here).</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=626Feature Ideas/Reprocessing2012-03-31T14:19:25Z<p>Brett: I'm taking point</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
Brett plans to work on this. If you want to help, get in touch!<br />
<br />
== Refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Core reprocessing code ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== How does this actually happen? ===<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here).</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas/Reprocessing&diff=625Feature Ideas/Reprocessing2012-03-31T14:18:26Z<p>Brett: first version</p>
<hr />
<div>== Rationale ==<br />
<br />
Sometimes, when we try to process new media, we'll fail. This could be for lots of reasons: maybe a transcoding process was killed by a crazed sysadmin, or the file is corrupted, or there might even be a bug in MediaGoblin (crazy, I know!). Right now, when this happens, the unprocessed media lives in the database forever, a zombie. Instead of that, we should periodically retry processing the media, when it makes sense. Maybe we'll have better luck next time; if we do, it'll make the user happy.<br />
<br />
== Refactoring ==<br />
<br />
Before I get started writing new code, I'd like to refactor the existing processing code. All of the media type processing.py files have chunks of code that look like this:<br />
<br />
<nowiki>prepare an output file<br />
with output file:<br />
process the media<br />
saved the processed version to the file</nowiki><br />
<br />
After all this is done, we save changes to the MediaEntry. The processing bit is unique, but the code for actually creating the files stays more or less the same, and I think it could be factored out to common functions in the main processing module. As a side benefit, I think separating the file handling code will give us the opportunity to make it more robust, which should help us fix issues like [http://issues.mediagoblin.org/ticket/419 #419].<br />
<br />
== Core reprocessing code ==<br />
<br />
=== When should we try to reprocess? ===<br />
<br />
Reprocessing can be more or less helpful depending on why previous processing attempts failed. If we failed because the machine was low on memory or disk at the time, reprocessing stands a good chance of succeeding. If we failed because the media is corrupt, reprocessing will never work unless some code has changed in the meantime.<br />
<br />
TODO: We should collect known cases of when processing failed, what it looked like. That will help us write code to determine why processing failed, and whether or not it's worthwhile to retry.<br />
<br />
=== When should we start reprocessing? ===<br />
<br />
There are two forces pushing us in different directions on this. On the one hand, the more often we retry, the sooner the user's media will appear on the site, which makes them happy. On the other hand, if we retry so often that not much can change between different attempts, we're just wasting computing resources to little end. This could hurt our performance on deployments without resources to spare, like SheevaPlugs or Raspberry Pi systems.<br />
<br />
Like scheduling code in Linux, there are a million different ways we could do this, no one system is going to be perfect for every site, and we're going to need feedback from lots of users telling us what's good and what's bad before we can make effective adjustments. With all that said, here are some ideas to consider for version 1:<br />
<br />
* Note when we last tried to process some media, and don't bother retrying more often than every X minutes. Alternatively, schedule the next time it's okay to retry the media, with some kind of algorithm like exponential backoff. <br />
<br />
* Whenever the processing queue is empty, if there's something worth retrying (based on the scheduling above), go ahead and retry it. In order to make sure that reprocessing happens on busy sites, we should probably also give entries a time where we will force the item onto the queue if it hasn't already been retried.<br />
<br />
TODO: Discuss (at a meeting?) general priorities about how we want to balance "users see their media ASAP" vs. general site performance. Get feedback about the version 1 scheme, and maybe get alternative proposals.<br />
<br />
=== How does this actually happen? ===<br />
<br />
TODO: Brett needs to investigate the code to figure out which part is responsible for scheduling tasks like this, and start writing in more detail where these different pieces get implemented. (Feel free to give him hints here!)<br />
<br />
=== cwebber's vague thoughts ===<br />
<br />
<nowiki>09:09 < paroneayea> I've been thinking vaguely about a few things related to <br />
that like<br />
09:10 < paroneayea> "what if you don't have the original anymore? Does it <br />
reprocess it into something more lossy?"<br />
09:11 < paroneayea> "Should we set it up so that things can determine <br />
conditionally if they should be reprocessed? Ie, if <br />
resolutions have changed, but this one was smaller than the <br />
new lowest resolution anyway?"<br />
09:11 < paroneayea> I'm not sure what the answer to those are but I've only <br />
thought vaguely about them.</nowiki><br />
<br />
== User visibility ==<br />
<br />
After we have reprocessing code, logged in users should be able to see information about where their entries stand in the queue: it's going to be processed, it's going to be reprocessed by such-and-such time, it failed completely. There are already some bugs about this (TODO: collect them here).</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas&diff=624Feature Ideas2012-03-31T13:28:03Z<p>Brett: add link for new idea page for media reprocessing</p>
<hr />
<div>== Introduction ==<br />
<br />
There are many features that one can think of for MediaGoblin. Some should be implemented really soon, because they are needed right now. Other features would be nice to have, but are currently really hard to implement. And finally there are the Feature Ideas that can be classified as "brain storming".<br />
<br />
This wiki page is mostly for long term feature ideas. This specifically means there are no promises that anything listed here will ever happen. It means nobody is currently working on this feature.<br />
<br />
If you have an idea for a new feature, that is not listed here or in the Bug Tracker, please talk to some developers, or add it below in the "Yet Unsorted Ideas" section. If you really think, that your idea is extremely important and needs to be acted upon soon, you could file a bug.<br />
<br />
== The List ==<br />
If there is a bug (closed or open), please link to it.<br />
<br />
=== Yet Unsorted Ideas ===<br />
Put your new ideas here:<br />
* [[Feature_Ideas/Flickr_Import|flickr Import]]<br />
* [[User:Aleksejrs/ideas/federation|Two federation ideas]]<br />
* [[Feature_Ideas/Reprocessing|Retry media processing]]<br />
* Copy (some) metadata from the full‐size image into the smaller versions. If possible (according to metadata formats), add a note to them that they are not exactly the original.<br />
** <del>[http://issues.mediagoblin.org/ticket/381 #381]: exif data handling for users (about privacy)</del><br />
** <del>[http://issues.mediagoblin.org/ticket/284 #284]: Support "Orientation" EXIF tag</del><br />
* Display more info about the file on its page<br />
** Resolution of the original<br />
** File size of the original<br />
** Resolution of the scaled-down version<br />
** EXIF info?<br />
** For non-still-images:<br />
*** Duration<br />
*** Bitrate<br />
*** Format structure (e.g. “Ogg/Theora+Vorbis”)<br />
* Renaming of an account<br />
** by its user<br />
** by an admin<br />
* Account creation / activation, considering e-mail address<br />
** Expire inactive accounts<br />
** A method of account activation: ask the user to send e-mail message with a specified text from the address they entered, instead of GMG sending a message to that address (which could be somebody else’s).<br />
*** What about faked addresses? If that’s a serious problem for this, could it still be a requirement? Or is it so serious that it would bother malicious misusers least of all people? --[[User:Aleksejrs|Aleksejrs]] 15:17, 7 December 2011 (EST)<br />
** The above allows accepting multiple inactive accounts for the same e-mail address without bothering its owner and without making it difficult for him to register.<br />
*** But accounts could have the same name… --[[User:Aleksejrs|Aleksejrs]] 15:14, 7 December 2011 (EST)<br />
<br />
=== Security related ideas / Features ===<br />
* DONE: CSRF ([http://bugs.foocorp.net/issues/361 #361])<br />
* <code>X-Content-Type-Options: nosniff</code><br />
*: Served pages have the content-type set. And the browser should not be allowed to guess a different type. See: [https://bugzilla.mozilla.org/show_bug.cgi?id=471020 Firefox bug #471020]<br />
* "Content Security Policy" (CSP) might really be a good add on to have. Noone should rely solely on this, but it might make things a lot safer if other security guards fail.<br />
*: A simple allow 'self' might already get a lot of things better.<br />
*: [https://developer.mozilla.org/en/Security/CSP/Introducing_Content_Security_Policy Link1] [https://developer.mozilla.org/en/Security/CSP/CSP_policy_directives#options Link2]<br />
* Possibly disallowing pages to be shown in frames.<br />
<br />
=== Long term things that ''might'' happen ===<br />
* "trans-tagging": Adding tags to other peoples media [http://bugs.foocorp.net/issues/584 #584]<br />
** [[Many images usecase#Crowd tagging/captioning/commenting]]<br />
* Branding: There should be some simple settable options to "personalize" the instance without fiddling with templates<br />
** [http://bugs.foocorp.net/issues/613 #613]: Make the base of page titles customizable.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Feature_Ideas&diff=623Feature Ideas2012-03-31T13:25:35Z<p>Brett: note finished bugs about metadata display</p>
<hr />
<div>== Introduction ==<br />
<br />
There are many features that one can think of for MediaGoblin. Some should be implemented really soon, because they are needed right now. Other features would be nice to have, but are currently really hard to implement. And finally there are the Feature Ideas that can be classified as "brain storming".<br />
<br />
This wiki page is mostly for long term feature ideas. This specifically means there are no promises that anything listed here will ever happen. It means nobody is currently working on this feature.<br />
<br />
If you have an idea for a new feature, that is not listed here or in the Bug Tracker, please talk to some developers, or add it below in the "Yet Unsorted Ideas" section. If you really think, that your idea is extremely important and needs to be acted upon soon, you could file a bug.<br />
<br />
== The List ==<br />
If there is a bug (closed or open), please link to it.<br />
<br />
=== Yet Unsorted Ideas ===<br />
Put your new ideas here:<br />
* [[Feature_Ideas/Flickr_Import|flickr Import]]<br />
* [[User:Aleksejrs/ideas/federation|Two federation ideas]]<br />
* Copy (some) metadata from the full‐size image into the smaller versions. If possible (according to metadata formats), add a note to them that they are not exactly the original.<br />
** <del>[http://issues.mediagoblin.org/ticket/381 #381]: exif data handling for users (about privacy)</del><br />
** <del>[http://issues.mediagoblin.org/ticket/284 #284]: Support "Orientation" EXIF tag</del><br />
* Display more info about the file on its page<br />
** Resolution of the original<br />
** File size of the original<br />
** Resolution of the scaled-down version<br />
** EXIF info?<br />
** For non-still-images:<br />
*** Duration<br />
*** Bitrate<br />
*** Format structure (e.g. “Ogg/Theora+Vorbis”)<br />
* Renaming of an account<br />
** by its user<br />
** by an admin<br />
* Account creation / activation, considering e-mail address<br />
** Expire inactive accounts<br />
** A method of account activation: ask the user to send e-mail message with a specified text from the address they entered, instead of GMG sending a message to that address (which could be somebody else’s).<br />
*** What about faked addresses? If that’s a serious problem for this, could it still be a requirement? Or is it so serious that it would bother malicious misusers least of all people? --[[User:Aleksejrs|Aleksejrs]] 15:17, 7 December 2011 (EST)<br />
** The above allows accepting multiple inactive accounts for the same e-mail address without bothering its owner and without making it difficult for him to register.<br />
*** But accounts could have the same name… --[[User:Aleksejrs|Aleksejrs]] 15:14, 7 December 2011 (EST)<br />
<br />
=== Security related ideas / Features ===<br />
* DONE: CSRF ([http://bugs.foocorp.net/issues/361 #361])<br />
* <code>X-Content-Type-Options: nosniff</code><br />
*: Served pages have the content-type set. And the browser should not be allowed to guess a different type. See: [https://bugzilla.mozilla.org/show_bug.cgi?id=471020 Firefox bug #471020]<br />
* "Content Security Policy" (CSP) might really be a good add on to have. Noone should rely solely on this, but it might make things a lot safer if other security guards fail.<br />
*: A simple allow 'self' might already get a lot of things better.<br />
*: [https://developer.mozilla.org/en/Security/CSP/Introducing_Content_Security_Policy Link1] [https://developer.mozilla.org/en/Security/CSP/CSP_policy_directives#options Link2]<br />
* Possibly disallowing pages to be shown in frames.<br />
<br />
=== Long term things that ''might'' happen ===<br />
* "trans-tagging": Adding tags to other peoples media [http://bugs.foocorp.net/issues/584 #584]<br />
** [[Many images usecase#Crowd tagging/captioning/commenting]]<br />
* Branding: There should be some simple settable options to "personalize" the instance without fiddling with templates<br />
** [http://bugs.foocorp.net/issues/613 #613]: Make the base of page titles customizable.</div>Bretthttps://wiki.mediagoblin.org/index.php?title=Talk:HackingHowto&diff=622Talk:HackingHowto2012-03-31T13:18:47Z<p>Brett: note need for python-sqlalchemy* on Debian</p>
<hr />
<div>With the SQL transition underway, on my Debian stable system, I had to install the <tt>python-sqlalchemy</tt> and <tt>python-sqlalchemy-ext</tt> packages to make MediaGoblin go. IRC advised me to get them from backports, which I did, and that's worked out, although there are packages in stable too -- not sure if they're new enough. Either way, should instructions about this be added to the documentation now? Or is a temporary transition thing? --[[User:Brett|Brett]] 09:18, 31 March 2012 (EDT)</div>Brett