Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-08nonbreaking_prefix.tdt: add "Nu" for "Numeru"Raphaël Merx
E.g. "Dekretu-Lei Nu. 18/2022" -> "Decree Law No. 18/2022"
2022-01-21Modify a comment on usage in the scriptswk0627
2021-03-13Add tokenisation support for the Tetun languageRaphael Merx
2020-08-03Allow Arabic letters to begin a fa sentenceKenneth Heafield
2020-07-31adding rules for Catalan Cristina España i Bonet
special characters within words and contractions closer to French than to English
2020-06-30escape ampersandsBarry Haddow
2020-06-02Merge pull request #221 from HjalmarrSv/masterHieu Hoang
Added some for sv
2020-05-23Update nonbreaking_prefix.svHjalmarrSv
Added Å Ä Ö, which are not unusual initials in names, e.g. Åke, Ärling, Östen. Added some new, but mostly variations on the existing ones. Both a dot after each letter (or pair) and a dot only after last letter are accepted forms. A couple of decades ago, there had to be a space after the dot, which explains the third form. The file for sv is much more useful with these few additions. Although, It is still far from complete. Removed: G (occured twice). In this list there is one item that is also a word, even when case is kept: tom. If all words are in small case, then tex, mao, tom (again), may be confused with names, and iaf, etc with named entities.
2020-03-19sentence splitter -k option to keep line boundariesKenneth Heafield
2020-03-19Add Pashto ؟ as a sentence splitting characterKenneth Heafield
2020-02-26flag to turn off sentence splitter from emitting <P>William Waites
2020-02-20Revert "line buffering for tokeniser and truecaser"Kenneth Heafield
This reverts commit 691717c42569fc94b9454d5ac862041684465654.
2020-02-17line buffering for tokeniser and truecaserWilliam Waites
2020-01-06Proper spacingalvations-patch-2alvations
2019-12-17ModernizedHjalmarrSv
I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text My proposed changes does the job. Basically I had to change by replacing the + at end of line 5 with *(\/)? The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (|) after reading various posts.
2019-12-16attempt to handle Korean better; only consider horizontal space in final splitBarry Haddow
2019-12-09split word on any type of spaceBarry Haddow
2019-11-25Single quotes should be escaped as single quotes.alvations-patch-normalizationalvations
2019-11-082 letter codesBarry Haddow
2019-11-08support for several Indic languagesBarry Haddow
2019-11-05initial hi non-breaking prefixesBarry Haddow
2019-11-05list itemsBarry Haddow
2019-11-05rupeesBarry Haddow
2019-11-05fix abbrev ruleBarry Haddow
2019-11-01devanagari fixBarry Haddow
2019-10-31reorganise indic supportBarry Haddow
2019-10-31use block notation for indic scriptsBarry Haddow
2019-10-31fix nbpBarry Haddow
2019-10-28full cjk testBarry Haddow
2019-10-28Merge branch 'master' of github.com:moses-smt/mosesdecoderBarry Haddow
2019-10-14Update replace-unicode-punctuation.perlKevin Canwen Xu
2019-10-01Undoing 05788925812f0d3265e355565cbb1701a0ad7510 alvations
Causes abbreviations to not split when ending with a fullstop. E.g. > The restructuring of IBM was essential to enable it organisationally to take up the responsibilities entrusted in the role with the recent changes in the policy and legislations, revised charter of function of IBM and the new activities and initiatives undertaken by IBM. IBM is also engaged in handholding the States for auction of mineral blocks for greater transparency in allocation of mineral concessions.
2019-09-30debugBarry Haddow
2019-09-30revert 05788925Barry Haddow
2019-09-30enable custom non breaking prefixesBarry Haddow
2019-09-30Merge branch 'master' of github.com:moses-smt/mosesdecoderBarry Haddow
2019-09-30do not add spaces in cjkBarry Haddow
2019-09-23Enable use strict pragmatitsuki
2019-09-04The dot before an acronym should be optional.alvations-patch-regexesalvations
2019-07-10Support for Urdu in sentence splitterAchim Ruopp
2019-04-26escape angle bracketsMatt Post
The script doesn't escape angle brackets which can result in bad SGML / XML output. This fixes that, although ideally, this should be implemented with a proper parser and dumper.
2019-02-27Fix non-ASCII lowercasingJoel Barry
2019-01-04Revert "use ucfirst instead of defined uppercase function"Hieu Hoang
This reverts commit dfbb17e549d4cb4ece452c7224ae47a590b7a4da.
2019-01-03Merge pull request #207 from alvations/patch-truecaserHieu Hoang
Reverting split_xml()
2019-01-03Reverting split_xml()alvations
2018-12-30consistent outputHieu Hoang
2018-12-20use ucfirst instead of defined uppercase functionalvations
2018-12-20split_xml should be consistent for training and usingalvations
2018-12-10increase cores to 16. For bitextor azure pipelineHieu Hoang
2018-12-08ems config for moses2Hieu Hoang