CSC - ContextSubCorrector - by epere4 DESCRIPTION: This program attempts to correct subtitle errors produced by OCR extraction. It corrects four kind of errors: * Spaces inside numbers. * Spaces before certain closing symbols. * Spaces after some opening symbols. * confusion between I (upper case i) and l (lower case L). DETAILS: * Spaces inside numbers: Sometimes the little spaces between two digits are taken as a space, so the output files contains a number splited in two or more parts. This happends a lot with the digit 1 (one). If you donīt know what I am talking about, I will give you some examples: "1999" can be confused and written as "1 999". "1,323" can be confused and written as "1 ,323". "3.432" can be confused and written as "3 .43 2". It is based in the fact that a digit can not be followed by a space and then by a digit (at least not in normal language), or followed by a space and then a comma or a dot. * Spaces before certain closing symbols: There are some symbols that mustnīt have spaces before them. The program deletes all the spaces before the following symbols: Closing Interrogation (?) Closing Exclamation (!) Comma (,) Dot (.) Two dots (:) * Spaces after some opening symbols: Some other symbols canīt have spaces after them. The program deletes the spaces that comes after the following symbols: Open Interrogation (ŋ) (English doesnīt have this, but other languages do) Open Exclamation (Ą) (idem) * confusion between I (upper case i) and l (lower case L): Many subtitles in DVDs comes in Arial. So the letters 'l' and 'I' are very similar. This program takes advantage of the fact that there are some places in a word where a 'l' can be placed, but an 'I' canīt, and the same thing happends with the 'I'. This are the rules the program follows in order to detect where there is an 'l' and should be an 'I' and viceversa: mIm --> mlm MIm --> Mlm mIM --> mlM MIM Valid _IM Valid _Im Valid mI_ --> ml_ MI_ Valid MlM --> MIM mlM --> mIM (weird case, we donīt correct it) Mlm Valid mlm Valid _lM --> _IM _lm Valid ml_ Valid Ml_ Valid MMl_ --> MMI_ _l_ --> _I_ Where: m: lower case letter M: upper case letter _: space I: letter I l: letter l KNOWN ISSUES: There maybe times where digits may have been placed with spaces between them for some special reason. Well, the program can not guess. There are some cases where without a dictionary it is impossible to determine whether there should be an 'l' or a 'I' (i.e: an 'I' at the beggining of a word when the rest of the word is in lower case. "Internet" is a valid word, and so is "lamp", but they could be wrong and be "linternet" and "Iamp", and the program wouldnt be able to correct anything). SOLUTIONS: CSC gives you a log file that lists all the lines that has been changed and the number of it, so you can manually fix the ones that you see that are wrongly corrected. Furthermore, if you donīt like what NSC has done to your file, you can restore with the backup copy that NSC makes for you. Anyway, after you corrected the subtitle with CSC you should use some other program (like Word) that can correct using dictionaries. You could have used that program in the first place, but CSC makes most of the boring-repetitive work, so the whole process takes you less time. HOW TO USE IT: The program is a console aplication, so it works on command line, but it can work if you just drag and drop the subtitle file on the program file. Or even better: If you use the installer of this program, you will be able to right-clic on a file and have an option in the context menu that will allow you to correct the subtitle. Usage in command line is very simple: CSC.exe [subtitle file] LICENCE: This program is absolutely free and can be used or distribute in any way that you want, the only thing I ask is that you quote the official site and/or the autho. The author of NSC is not responsible for any harm that the program can cause. The use that the user gives to this program is his/her absolute responsibility. SOURCES: Ask for them and I will give you them :-) CONTACT: Any comments or suggestions are welcomed at epere4 [at] gmx . net Visit my website http://home.no/epere4 Visit Doom9 for the most comprehensive guides for DVD backup (http://www.doom9.org/) Visit Doom9 in Spanish for a Spanish translation of Doom9īs site (http://spanish.doom9.org/) FUTURE DEVELOPMENT: -Donīt know. What do you suggest? CHANGELOG: Version 0.2 beta - June 20th, 2003 The name: The prog is now called ContextSubCorrector, and not NumSubCorrector. lI correction. Deletes spaces before to any of the following symbols: '?', ',', '.', ':', '!' Deletes spaces that comes after any of the following symbols: 'ŋ', 'Ą' A little improvemente in the presentation of the log file. Dual Language: Now the program comes also in Spanish, and not only in English. You can choose language from the installer. Version 0.1 beta - June 16th, 2003 First Release. Only corrects spaces between numbers.