Current Version: 0.79

January 13rd, 2016

Google Summer of Code 2015

This page is no longer maintained. Please go to the new one.

Talk about Summer of Code at Juan Carlos I University, March 5, with successful mentors from past years. It's in Spanish but we'll be happy to answer questions in English or talk afterwards. G+ event

We're in again! If you are here as an student considering applying with us, please read on. If after reading this page and the FAQ page you have questions of any kind, please contact Carlos at

Getting accepted - best practices (general).


This will apply to most projects. Details specific to CCExtractor later.
1) There are several pages on getting accepted to different GSOC projects. Make sure you read them.
2) Get in touch early. While you can only submit your proposal in a specific window you can contact organizations and mentors at any time.
3) If you are already doing something for the project you want to join by all means let the other developers check it out. Don't "save the best" for GSOC! There's always things to do, so if you can impress the selection committee (whoever they are in the project you want to join) make sure you do it when it counts.
4) Proposals are not exams. You are not expected to just write the best one you can and hope for the best. Include your organization in preparing the proposal. Submit a draft, request feedback, improve and so on. Everyone want the best possible proposals from the best possible students. Mentors will be happy to steer you in the right direction.
5) Don't overpromise. Don't underpromise. Be realistic with what you are able to do with 3 months of your time working full time.

Getting accepted - sample exercises (CCExtractor).


While your proposal is the document that describes what you plan to do and its contents is what you (mostly) commit to, the best way to get in the short list is to show what you can do before mentors start selecting proposals. After all a proposal is a document and everything compiles and runs on paper :-)
Last year we proposed a few exercises to students. A couple of them were moderately hard (even though they required patience and analytic skills, not knowledge of CCExtractor's internals). Three students solved them and those students got all slots. Of course their proposals were good to (also a requirement), but their determination to solve our challenges is what made a difference.
This year we have a few more exercises. They are optional, but solving them would earn you major points.

1 - Write a small script (any language) that returns the names of all functions in the program (for the github version).
2 - Compile CCExtractor. There are a few (harmless) warning messages (the specific messages vary from compiler to compiler). What’s the best way to solve each of them for your specific compiler?
3 - Find out as many subtitles standard around the world. CCExtractor currently support CEA-608, CEA-708, teletext and DVB. What other formats are there (both broadcast and physical media) and where would you start if we were to add support for each of them? Be specific.
4 - Are you able to craft an input file that makes CCExtractor crash? If yes, how would you fix CCExtractor so it doesn't crash on that file?
5 - The following known issues have been around for a bit. Are you able to solve any of them? (note: always use the current code in github, don't start with the the last stable version). Garbled output (1).  Garbled output (2).   MP4 detection.  Garbled output (3).

Getting accepted - Your proposal.

1 - Be more specific in your proposal. By specific we mean detailed in your plans. For example, if your proposal is about doing multituner [this is one of the ideas below]: What tuner -brand, model- works in your country? What parts of CCExtractor code will you need to change? How? Do you plan to start a process per tuner, use multithreading, maybe socket select? If multithreading, how will you avoid races? If multiprocess, how will you send updates to GUI? There's many good ways to implement multituner for each solution brings its own set of problems. Analyze the options and explain the one you prefer.
2 - Reconsider your time line. We've seen some proposals in which a whole week is spent setting up the development environment. Or two weeks on documentation, etc. That's not realistic. While we understand that you don't want to overpromise, if you are too conservative on your schedule it just looks like you will spend a few minutes a day on CCExtractor instead of 8 hours or so. This is a paid job, and must be taken as such.
3 - Even if you think there isn't time to add more tasks for the ideas below, think how you would implement most of them if there was time and explain.
4 - Write somewhere that you are OK doing things different than the ones you proposed. This is because if we get 2 slots we will pick the 2 best students, even if they proposed to do the same things. So we will split the work and you might end up doing something different than planned.
5 - Tell us about your planned absences. Vacations, going to a wedding, exams, whatever. We don't need details on what you will be doing when you are not around, but we need to know when you are not going to be around.
6 - Write in the proposal your planned work schedule (days of the week you will be working and times of the day). Remember to mention your local times. On days on the week: It should be 5 days, but it can be any 5. It's fine if you work on Sundays but not on Fridays, for example.
7 - Proof read your proposal. Grammar, spelling, etc is important. We don't expect you to be Shakespeare but not running a spell checker says a lot about you.
8 - If you applied to more GSOC orgs please let us know. You don't have to, but it will make our life easier when it's time to resolve duplicates.
9 - Last year, lots of proposals included the OSX/Linux GUI (often, just the GUI). We can't stress this enough: Just the GUI is not going to be enough, period. The GUI should take one week, or at the most two if you are learning while doing it. Keep in mind that you can just copy the UI from the Windows version (for which the source code is available). If you want to do the GUI and you have free time now, you can make a preliminary version and submit it. It will also give you a few points. Note: If it's good we would release it even you are not accepted.
10 - GSOC is about code. If you set aside any time for documentation in your proposal, remove it. Code should be self-documented and that documentation is written while coding. So it doesn't make sense to put time aside just for documentation.
11 - Tell us a bit about your work environment. Will you be working from home, from a library...? How's the internet connection? Are there outages?
12 - Make sure you read all the general GSOC documentation. GSOC's methodology is well tested. It works.


THE IDEAS

Here are some of the tasks we would love to do over the summer. The difficulty is based on both the required functionality itself and how hard we think it will be to refactor CCExtractor to add it. It does not correlate to the number of hours it will take; some tasks are easier but possibly longer than others.

By the way, these are our ideas. You may have different ones. Summer of Code is mostly about the students; if you prefer to do something else you can. Just propose it. Feel free to get in touch if you want early feedback on your plans. Doing something different than what we suggest doesn't lower your chances of getting in and may raise the chance that you are highly motivated and enjoy the summer.

TitleDescriptionDifficulty
Real time uploadingClosed captions is used by a number of data aggregation companies as once of the sources they add to their data pool. They use the information to correlate appearances on media to things like stock. While doing that for twitter -to mention just one- is easy because there's even an API, it's a lot more difficult to do it for closed captions. We want CCExtractor to be able to upload data as it's produced so it can be used for real time analysis.
A working proof of concept was done last year by Ruslan.
Expected outcome: Building on Ruslan's work, complete this job by adding real time uploading to all protocols.
Medium
Worldwide repositoryBuilding on the previous task, it makes sense to aggregate all available caption data. A problem with captioning is of course that you need to be able to receive the TV signal, which makes it impossible for any individual entity to capture data from all TV channels in the world. We want to allow everyone with the required hardware to be able to contribute to a global effort; the task is to build an scalable system that is able to receive data for a large number of providers. Google's App Engine would be a good option to start building. Note: A proof of concept exists, using LAMP.
Expected outcome:Prepare and deploy a complete system and scales not just in theory (i.e. create load simulators for example). We will spend the organization funds on renting test machines.
Medium
Finish CEA-708 supportEIA-708 is the "new" standard for closed captioning. While the specification has been around for some years and support for it is mandatory in the US for both TV receivers and stations, until very recently almost all stations have simply converted their CEA-608 data to 708. This means many 708 features are rarely used; for instance, you still often see captions in all uppercase. This is starting to change though, so it makes sense for CCExtractor to fully implement a 708 decoder. Some work was done already, and you can actually see 708 output in CCExtractor in debug mode. But it needs to be completed by adding the actual export features.
Expected outcome: We have hundreds of samples with 708 subtitles. The goal is generate perfect transcripts for all of them, or when it's not possible (for example because there's overlapping Windows that cannot be represent in a plain text file) detect and report the problem.
Medium
Create a OSX GUICCExtractor is a console program, which is very convenient for a number of things, particularly those involving anything automatic. It also makes it harder to use by regular users in their desktops. For Windows we just have a nice GUI that calls the console program. We'd like to have a GUI written for OSX as well.
Expected outcome: A OSX GUI that is as easy to use as the Windows counterpart.
Easy
Create a Linux GUISame as for OSX.
Expected outcome: A Linux GUI that is as easy to use as the Windows counterpart.
Easy
CC insertionCCExtractor is able to extract the subtitles from almost anything you throw at it: MPEG2, H264... in MP4, TS... doesn't matter if it's teletext or closed captions, or if the media is from Europe, Australia or North America. If the data is there, chances is CCExtractor can give it to you in a text file. However, what CCExtractor cannot yet do is insert data into existing files, i.e. add captions where there aren't any. The job here: Be able to take an uncaptioned media file and insert the CC data.
Expected outcome: Ability to take an input video file, a timed transcript file, and output the same video with closed captioning embedded.
Very hard
New standardsCEA-608 is the old (analog) standard from closed captions. CEA-708 is the standard for digital TV. But what about the standard for internet media? CCExtractor has some basic support for TTML, but there's other emerging standards that we should support as well, both for input and for output.
Expected outcome: Added support to the most promising new standards.
Medium
Multi programIn the digital world, a number of programs are transmitted simultaneously (multiplexed) in a single channel. The tuner receives all those programs, and then the receiver filters the one the user wants to watch, discarding all the others. CCExtractor does this too - if a stream contains more than one program you have to pick one. The goal is to modify CCExtractor so it's able to process all programs in the stream at the same time, generating the transcript for all of them in one pass.
Expected outcome: Launching just one CCExtractor (you can spawn processes of course) read a stream with one than one program and generate transcripts for all the programs at the same time.
Medium
Multi channelA number of TV devices (most famously HDHomeRun) come with more than one tuner (some models as many as 6), allowing the user to watch several channels at the same time. CCExtractor is able to receive the data from HDHomeRun directly (no need for intermediate files) but it only listens to one tuner. The job: Modify CCExtractor so it's able to listen to any number of tuners at the same time.
Expected outcome: Launching just one CCExtractor (you can spawn processes of course) process more than one stream at the same time.
Medium
ISDB supportISDB is a standard used in Brazil, Japan and some South/Central American countries. The job: Complete CCExtractor's support (currently basic).
Expected outcome: Complete support for languages with Latin alphabets and at least a proof of concept for Japanese.
Medium
DTMB supportDTMB is a standard used in China. The job: Add support in CCExtractor.
Expected outcome: Basic Chinese support.
Hard
Automatic translationUse Google Translate APIs to translate transcripts in real time
Expected outcome: Add a new parameter -translate $language. If present, use Google Translate to deliver the translated version.
Easy for proof of concept
medium for good results

On top of these CCExtractor-only features, there's some cross-project work we are considering.

TitleDescriptionDifficulty
Red HenRed Hen is an amazing effort by a number of universities around the world. They record hundreds of TV programs every day and analyze communications in verbal and non-verbal ways. For verbal communication they use CCExtractor. So there's a number of colaboration opportunities with Red Hen.Varies
HandbrakeHandbrake is one of the best video transcoders. Its closed captioning stuff is a slightly modified static version of CCEXtractor. Since last year we made a CCExtractor library, we can help Handbrake by having it use our library so they can easily keep pace with us.Medium
ffmpegLast year we modified our code so we could use ffmpeg's code to parse video files instead of using our internal decoders. ffmpeg however lacks decent closed captioning support. Using our library we could give back a bit by doing the integration work.Medium

THE MENTORS

Useful things to know about the mentors:

- The technical mentors are all part of the development team. So you get access to the people behind the project.
- The non-technical mentors use CCExtractor as a key part of their architecture. They run the largest sites with CCExtractor such as UCLA's video archive.
- Everyone takes mentorship very seriously. You will have access to everyone in a Skype group in which everyone hangs out, email, chat, etc. Note that in return for mentors being available to you every day you are expected to make use of the open channel to mentors very often.
- We are active in our local developers communities such as GDG (Google's Developers Groups).
- We attend developers events often, big (such as I/O) and small (such as local DevFests). If you see us in one by all means introduce yourself.
- We also have a tight relationship with Google Spain which often sponsors our activities.
- We are also Glass Explorers always looking for ways to make Glass a part of our subtitle world.

The mentors (with links to linked-in, feel free to request a connection if you have a genuine interest in CCExtractor and GSOC - mention it in the request)

Veteran mentors (everyone repeats from last year)
Carlos Fernández Sanz, CCExtractor original developer and still maintainer. For everything related to CCExtractor internals or CC data formats.
Ander Martínez, works with Carlos in a number of projects. Go-to guy for Android, Glass, and wearable devices in general.
Andreu Ibañez, a GSOC veteran with a number of quite successful projects. A mentor of mentors :-)
David Liontooth, runs the CCExtractor operation at UCLA. He provides a real life complex environment for CCExtractor, with many TV stations from around the world being processed in real time.
José Miguel Martínez, an expert among other things in the Google App Engine ecosystem.
Candela F. Torrent, a biologist, you'd say she's an unlikely CCExtractor match, but she moonlights in the subtitle world as a top notch synchronizer. A demanding final user.

Resources for CCExtractor work:
Of course, the source code of CCExtractor has always been available. So you can already look around.
Before you start working -if you really, really need it now get in touch- we will give you access to our video sample repository. This is a collection of videos that work well (and must continue to work well after any code changes) and videos that don't work for any reason and that we want to support.