diff options
author | ZJaume <jzaragoza@prompsit.com> | 2022-07-27 14:28:15 +0300 |
---|---|---|
committer | ZJaume <jzaragoza@prompsit.com> | 2022-07-27 14:28:15 +0300 |
commit | 54a67245c43832325707c16e3ca521e429bc72e3 (patch) | |
tree | 477aec396391e94fd211b4deed9c36d8c5207ca9 | |
parent | eaf7036f1f4a4d4813dce939553d0304f83bbbee (diff) |
Update CHANGELOG
* Reformat to keepchangelog.com style
* Remove old Bicleaner changelogs
* Add new changes to the unreleased section
-rw-r--r-- | CHANGELOG.md | 108 |
1 files changed, 19 insertions, 89 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md index 5702013..28ee746 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,21 +1,30 @@ -Unreleased: -* Update to Hardrules 2.0 - * Rules can be parametrized with `--rules_config config.yaml` - * Some rules have been refactored with better names. - * `--run_all_rules` mode to run each rule instead of stoppping at first discard - * Language identification with [FastSpell](https://github.com/mbanon/fastspell) -* Huge memory improvements during training. +## Unreleased: +### Added * Hide Tensorflow and Transformers logging messages in executable scripts. -* Update HF Transformers, no longer needed single GPU for prediction. +* Redirect Keras prediction progress bar to stderr. +* Huge memory improvements during training. +### Changed +* Update to Hardrules 2.3 + * Rules can be parametrized with `--rules_config config.yaml` + * Some rules have been refactored with better names. + * `--run_all_rules` mode to run each rule instead of stoppping at first discard + * Language identification with [FastSpell](https://github.com/mbanon/fastspell) + * Easier installation! Now KenLM comes pre-compiled. +* Now BICLEANER\_AI\_THREADS environment variable controls the number of threads. +* Update HF Transformers. +* Update TensorFlow minimum version. +* Set inter/intra\_op parallelism to 0 by default. +* Add citation info to README. +### Fixed * Avoid generating empty sentences in omit noise. * Restore capital letters at the beggining of the sentennce in frequency noise. * Fix loading lite models in other other Python versions than 3.8. * Other minor fixes. -Bicleaner AI 1.0.1: +## Bicleaner AI 1.0.1: * Update hardrules to 1.2: adds score only mode. -Bicleaner AI 1.0: +## Bicleaner AI 1.0: * Bicleaner train changes: * Separate most of the training logic in the BaseModel class. * Re-factor synthetic noise build function. @@ -27,82 +36,3 @@ Bicleaner AI 1.0: * Bicleaner classify changes: * Change old classifier by new neural models. * Move hardrules into a separate package. - -Bicleaner 0.15: -* Bicleaner train changes: - * Qmax bug fixing. - * Classifier training uses the number of processes given by argument. -* Bicleaner classify changes: - * Refactored classifier scripts: code cleaning and remove lot of duplicated code. - * Buffered tokenization: improve speed of external tokenization tokenizing blocks of lines instead of line by line. - -Bicleaner 0.14: -* Bicleaner hardrules changes: - * New rule: filter out sentences containing gluedWordsLikeThis. - * Rule change: Relaxed c_different_language rule for similar languages. - * New rule: filter out porn sentences using FastText classifier. - * Parameters changed: `-s/--source_lang` and `-t/--target_lang` are no longer mandatory (if a metadata .yaml file is provided) -* Bicleaner train changes: - * Default classifier is now `extra_trees` - * New parameters: `-f` and `-F`, source and target word frequency dictionaries. - * New qmax features: - * `qmax_nosmooth_nolimit_freq`: removes OOV smoothing, word limits and weights each target word with its monolingual probability using the word frequency dictionary. - * `qmax_nosmooth_nolimit_cummulated_prob_zipf_freq`: uses accumulated probability instead of maximum and splits the score into quartiles based on word frequencies. - * Added more bilingual dictionary coverage features, splitting them into quartiles based on monolingual word frequencies. - * Added new noise function that synthesizes negative samples cutting sentences and replacing words (this is not used by default, needs more testing). - * Changed classifier training behavior and use grid search. - * Removed `bicleaner_train_lite.py` - * Removed parameters: `-g` (`--good_examples`) and `-w` (`--wrong_examples`): - * Now, training automatically uses one half of the input file for good examples and the other half to synthesize wrong examples. - * Of this partitions, 90% will be used for training and the remaining 10% for testing. - * New parameter: `--relative_paths` allows to save model files paths relative instead of absolute (useful for training distributable models) - * Changed logging info messages, now more informative. -* Other - * Now using [sacremoses](https://github.com/alvations/sacremoses) instead of [mosestokenizer](https://github.com/luismsgomes/mosestokenizer) - * New script: `./utils/download-pack.sh` allows to download language packs for a given language pair. - - -Bicleaner 0.13: -* Bicleaner hardrules changes: - * Rule change: Relaxed c_minimal_length to accept 3-word sentences - * New feature: LM filtering (moved from Bicleaner Classify) - * New parameter: `--disable_lm_filter`, `--metadata` and `--lm_threshold`, to support LM filtering -* Bicleaner training changes: - * New parameter: Features relying on language identification can be disabled with flag `--disable_lang_ident` (this will be outputed in the .yaml file and used by Bicleaner clasifier) - * New feature: Debug mode now gives information on random forest feature importances - * Parameter change: --noisy_examples_file_sl and --noisy_examples_file_tl are now optional - * Parameter change: input now must be more than 10K sentences long - * Removed INFO messages when processes starting/ending (except when debugging) -* Bicleaner classifier changes: - * `--disable_lang_ident` flag is now read from the .yaml file - * Removed feature: LM filtering (moved to Bicleaner Hardrules) - * New parameter: `--disable_lm_filter` - * Removed parameters: `--keep_lm_result`, `--threshold` -* Other: - * Updated requirements - - - -Bicleaner 0.12: -* Bicleaner hardrules changes: - * New rule: c_identical_wo_punct to reject sentences only different in punctuation (and it's case insensitive) - * New rule: Sentences containing "Re:" are rejected - * Rule change: c_minimal_length now rejects sentences with both sides <= 3 words (instead of only one) - * Rule change: c_identical and c_identical_wo_digits now is case insensitive - * Rule change: Breadcrumbs rule now split into c_no_breadcrumbs1 and c_no_breadcrumbs2 - * Rule change: Breadcrumbs2 now includes character "ยท" in the rejected characters - * Rule change: c_length now compares byte length ratio (will avoid rejecting valid sentences due to length ratio when comparing languages with different alphabets) - * Changed behaviour for `--annotated_output` argument in hardrules. See README.md for more information. - * New parameter: `--disable_lang_ident` flag to avoid applying rules that need to identify the language -* Bicleaner classify changes: - * Now using only 3 decimal places for Bicleaner score and LM score - * Removed INFO messages when processes starting/ending (except when debugging) - * New parameter: '--disable_hardrules' flag to avoid applying hardrules - * New parameter: '--disable_lang_ident' flag to avoid applying rules that need to identify the language - * New parameter: '--score_only' flag to output only Bicleaner scores (proposed by [@kirefu](https://github.com/kirefu)) -* Bicleaner features changes: - * Fixed bug when probability in prob_dict is 0 (issue [#19](https://github.com/bitextor/bicleaner/issues/19)) -* Other: - * Fixed sklearn version to 0.19.1 - * Added utilities for training: `shuffle.py` and `dict_pruner.py` - * Updated instalation guides in readme |