Current Version: 0.79

January 13rd, 2016

Google Summer of Code 2014

This page is no longer maintained. Please go to the new one.

This year we are a mentoring organization in Google's Summer of Code.

If after reading this you have questions, please check our FAQ
The accepted students for all projects has been released by Google. We've been very lucky and landed 3 slots, which we assigned to 3 excellent students with excellent proposal and bug hunting skills.

Anshul, Ruslan and Willem will be working during the summer to make CCExtractor, already the open source reference tool for closed captioning, feature complete.

The rest of this page is no longer up to date, but we're going to leave it up as it might have value to those who applied but didn't get in, or plan to apply next year. Feel free to check it out (particularly the exercises).

By the way, if you didn't get accepted - you can still join us in the coding effort if you feel like enjoying a summer of coding.


Some of you solved the garbling issue with the now famous Casino video linked in our FAQ page. As we mentioned, this gives you lots of points, as it shows two of the skills we need: Patience and analytical thinking. For this week, we have a new exercise. Please try to figure it out, even if you didn't solve the first one (by the way, if you didn't: there's still time, you can find it at the FAQ page). The effort made here will pay off (even if you don't solve it).

So here's the problem. We've seen a number of video files with a time discontinuity. For some unknown reason there's a jump in time in the middle of the file, causing the resulting transcript to get out of sync with the video. Goal: Find out why, and if possible solve it.

Example file (source video)
Example file (current transcript)

Additionally, what else can you do to improve your chances of being accepted?
Here's a few sample exercises (please add your answers to your proposal and then email us so we can check it out):

1 - Write a small script (any language) that returns the names of all functions in the program (for the CVS version).
2 - The function names are a mess. How would you rename each function so the new names (including capitalization and so on) are both meaningful and consistent?
3 - Compile the program. There are a few (harmless) warning messages (the specific messages vary from compiler to compiler). What’s the best way to solve each of them for your specific compiler?
4 - Find out as many subtitles standard around the world. CCExtractor currently support CEA-608, CEA-708 and teletext. What other formats are there (both broadcast and physical media) and where would you start if we were to add support for each of them? Be specific.
5 - Are you able to craft an input file that makes CCExtractor crash? If yes, how would you fix CCExtractor so it doesn't crash on that file?

What else?
1 - Be more specific in your proposal. By specific we mean detailed in your plans. For example, if your proposal is about doing multituner: What tuner -brand, model- works in your country? What parts of CCExtractor code will you need to change? How? Do you plan to start a process per tuner, use multithreading, maybe socket select? If multithreading, how will you avoid races? If multiprocess, how will you send updates to GUI? There's many good ways to implement multituner for each solution brings its own set of problems. Analyze the options and explain the one you prefer.
2 - Reconsider your time line. We've seen some proposals in which a whole week is spent setting up the development environment. Or two weeks on documentation, etc. That's not realistic. While we understand that you don't want to overpromise, if you are too conservative on your schedule it just looks like you will spend a few minutes a day on CCExtractor instead of 8 hours or so. This is a paid job, and must be taken as such.
3 - Even if you think there isn't time to add more tasks for the ideas below, think how you would implement most of them if there was time and explain.
4 - Write somewhere that you are OK doing things different than the ones you proposed. This is because if we get 2 slots we will pick the 2 best students, even if they proposed to do the same things. So we will split the work and you might end up doing something different than planned.
5 - Tell us about your planned absences. Vacations, going to a wedding, exams, whatever. We don't need details on what you will be doing when you are not around, but we need to know when you are not going to be around.
6 - Write in the proposal your planned work schedule (days of the week you will be working and times of the day). Remember to mention your local times. On days on the week: It should be 5 days, but it can be any 5. It's fine if you work on Sundays but not on Fridays, for example.
7 - Proof read your proposal. Grammar, spelling, etc is important. We don't expect you to be Shakespeare but not running a spell checker says a lot about you.
8 - If you applied to more GSOC orgs please let us know. You don't have to, but it will make our life easier when it's time to resolve duplicates in a couple of weeks.
9 - Lots of proposals include the GUI (often, just the GUI). We can't stress this enough: Just the GUI is not going to be enough, period. The GUI should take one week (or two, if you are learning while doing it) at the most. Keep in mind that you can just copy the UI from the Windows version (for which the source code is available). If you want to do the GUI and you have free time now, you can make a preliminary version and submit it. It will also give you a few points. Note: If it's good we would release it even you are not accepted.
10 - GSOC is about code. If you set aside any time for documentation in your proposal, remove it. Code should be self-documented and that documentation is written while coding. So it doesn't make sense to put time aside just for documentation.
11 - Tell us a bit about your work environment. Will you be working from home, from a library...? How's the internet connection? Are there outages?

Below - the original ideas page and project explanation, for reference.

Yes! We've been accepted by Google. Of course, this is an incredible chance to give CCExtractor -already the open source reference implementation for closed captioning- a push that gets it everywhere where a closed captioning tool is needed.
This page is still the ideas page, but it will be evolve quickly to provide as much information as possible on the tasks. If you are a student interested in video and/or subtitles by all means apply to GSOC. It's an amazing chance to spend the summer programming on the most interesting open source projects while getting paid and also get a cool point in the resume.

Our motivation to apply for GSOC as a organization
As written in our GSOC application: If you've ever downloaded a .srt file to get subtitles for anything, most likely CCExtractor was used to produce it.

To be clear, while CCExtractor job is to produce these transcript files from video streams, this year we need students interested in a number of more generic things, such as multithreading, networking... having an interest in subtitles in particular, or video encoding is a plus, but not absolutely needed.

The goal of our GSOC application is to make the best possible closed captioning tools available for everyone.

While closed captioning used to be a niche area, now that everyone can create and distribute content, subtitles are more important than ever - yet most tools are proprietary. The new closed captioning rules approved by the FCC make a good closed captioning solution more important than ever. While expensive professional tools exist, there's no other open source tool that comes close to CCExtractor's features.

There's a lot of very interesting things to do, and more ideas are welcome. We encourage both students and CCExtractor users to submit their own suggestions.

GSOC related questions (both before and during the actual GSOC period) take the highest priority for CCExtractor's developers and mentors.

Applying as a student
Everything is done via GSOC's website (your proposal must be sent there, for example). First, start by reading their FAQ if you haven't done it already.
As the FAQ says, you are encouraged to get in touch with the organizations before sending your application. Make sure you contact us with ideas, suggestions, plan... ask questions about the code or anything else you'd like to know before applying.
Important - we are replying quickly (as soon as we see the email) to everyone, but asking things that are in this document or in GSOC's FAQ doesn't look too good :-)

Here are some of the tasks we would love to do over the summer. The difficulty is based on both the required functionality itself and how hard we think it will be to refactor CCExtractor to add it. It does not correlate to the number of hours it will take; some tasks are easier but possibly longer than others.

Real time uploadingClosed captions is used by a number of data aggregation companies as once of the sources they add to their data pool. They use the information to correlate appearances on media to things like stock. While doing that for twitter -to mention just one- is easy because there's even an API, it's a lot more difficult to do it for closed captions. We want CCExtractor to be able to upload data as it's produced so it can be used for real time analysis.
The task includes everything from defining the protocol to make the changes in CCExtractor and write reference a implementation of the data receiver.
Wordwide repositoryBuilding on the previous task, it makes sense to aggregate all available caption data. A problem with captioning is of course that you need to be able to receive the TV signal, which makes it impossible for any individual entity to capture data from all TV channels in the world. We want to allow everyone with the required hardware to be able to contribute to a global effort; the task is to build an scalable system that is able to receive data for a large number of providers. Google's App Engine would be a good option to start building.Hard
Finish CEA-708 supportEIA-708 is the "new" standard for closed captioning. While the specification has been around for some years and support for it is mandatory in the US for both TV receivers and stations, until very recently almost all stations have just converted their CEA-608 data to 708; this means that none of the 708 features have actually been used, and you still see many captions in all uppercase (to mention just one thing). This is starting to change though, so it makes sense for CCExtractor to fully implement a 708 decoder. Some work was done already, and you can actually see 708 output in CCExtractor in debug mode. But it needs to be completed by adding the actual export features.Medium
DVB SubtitlesDVB subtitles are bitmap based subtitles used in Europe (there's also teletext there, which comes from the analog world). For now CCExtractor is unable to extract them. However, because specifications are freely available, because there's some open source implementations we can use as a reference (such as Project X) and because part of the work is common to everything else (i.e. we can use what we already have to get to the DVB data, we just need to process it) this shouldn't be extremely difficult.Medium-Hard
Create a OSX GUICCExtractor is a console program, which is very convenient for a number of things, particularly those involving anything automatic. It also makes it harder to use by regular users in their desktops. For Windows we just have a nice GUI that calls the console program. We'd like to have a GUI written for OSX as well.Easy
Create a Linux GUISame as for OSX.Easy
CC insertionCCExtractor is able to extract the subtitles from almost anything you throw at it: MPEG2, H264... in MP4, TS... doesn't matter if it's teletext or closed captions, or if the media is from Europe, Australia or North America. If the data is there, chances is CCExtractor can give it to you in a text file. However, what CCExtractor cannot yet do is insert data into existing files, i.e. add captions where there aren't any. The job here: Be able to take an uncaptioned media file and insert the CC data.Very hard
New standardsCEA-608 is the old (analog) standard from closed captions. CEA-708 is the standard for digital TV. But what about the standard for internet media? CCExtractor has some basic support for TTML, but there's other emerging standards that we should support as well, both for input and for output.Medium
Multi programIn the digital world, a number of programs are transmitted simultaneously (multiplexed) in a single channel. The tuner receives all those programs, and then the receiver filters the one the user wants to watch, discarding all the others. CCExtractor does this too - if a stream contains more than one program you have to pick one. The goal is to modify CCExtractor so it's able to process all programs in the stream at the same time, generating the transcript for all of them in one pass.Medium
Multi channelA number of TV devices (most famously HDHomeRun) come with more than one tuner (some models as many as 6), allowing the user to watch several channels at the same time. CCExtractor is able to receive the data from HDHomeRun directly (no need for intermediate files) but it only listens to one tuner. The job: Modify CCExtractor so it's able to listen to any number of tuners at the same time.Medium
Test suiteWe have a reasonably decent collection of samples of all kinds, from a number of sources. Often, fixing a problem that appears in just one sample breaks something else. We need to automatize tests so we can easily compare the output of different CCExtractor versions and get a useful reports (which files changed and what).Medium

On top of these CCExtractor-only features, there's some cross-project work we are considering.

ffmpeg integrationAs you may know, CCExtractor has built-in parsers for everything. Zero dependencies. Quite convenient, but of course it means that except for ports we aren't building on the work of giants like ffmpeg. We would like to -optionally- build CCExtractor using ffmpeg's instead of the internal parsers. This will probably require some changes in ffmpeg itself (because as far as we know it doesn't allow to get the subtitle data even in raw format). Your job would be to make changes in both projects and have them added to the mainstream versions.Probably very hard
Library-ize CCExtractorCCExtractor has a quite robust CEA-608 decoder that could be used by any other program. However, in the way it's currently packed, such program would need to copy and paste some of our init code, have some global variables... so as you can see, it's not really a library you can just embed into a 3rd party program. The job: Refactor it, and produce a reference program that builds on the refactored code to process a sample file.Medium

Useful things to know about the mentors:

- The technical mentors are all part of the development team. So you get access to the people behind the project.
- We are active in our local developers communities such as GDG (Google's Developers Groups).
- We attend developers events often, big (such as I/O) and small (such as local DevFests). If you see us in one by all means introduce yourself.
- We also have a tight relationship with Google Spain which often sponsor our activities.
- We are also Glass Explorers always looking for ways to make Glass a part of our subtitle world.
- The "user" mentor (guiding new useful features in the "real world" and providing access to video streams from around the world) is in charge of UCLA's video archive.

Resources for CCExtractor work:
Of course, the source code of CCExtractor has always been available. So you can already look around.
Before you start working -if you really, really need it now get in touch- we will give you access to our video sample repository. This is a collection of videos that work well (and must continue to work well after any code changes) and videos that don't work for any reason and that we want to support.

The mentors (with links to linked-in, feel free to request a connection if you have a genuine interest in CCExtractor and GSOC - mention it in the request):
Carlos Fernández Sanz, CCExtractor original developer and still maintainer. For everything related to CCExtractor internals or CC data formats.
Ander Martínez, works with Carlos in a number of projects. Go-to guy for Android, Glass, and wearable devices in general.
Andreu Ibañez, a GSOC veteran with a number of quite successful projects. A mentor of mentors :-)
David Liontooth, runs the CCExtractor operation at UCLA. He provides a real life complex environment for CCExtractor, with many TV stations from around the world being processed in real time.