Update CHANGELOG

* Reformat to keepchangelog.com style * Remove old Bicleaner changelogs * Add new changes to the unreleased section
author: ZJaume <jzaragoza@prompsit.com> 2022-07-27 14:28:15 +0300
committer: ZJaume <jzaragoza@prompsit.com> 2022-07-27 14:28:15 +0300
commit: 54a67245c43832325707c16e3ca521e429bc72e3 (patch)
tree: 477aec396391e94fd211b4deed9c36d8c5207ca9
parent: eaf7036f1f4a4d4813dce939553d0304f83bbbee (diff)
1 files changed, 19 insertions, 89 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5702013..28ee746 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,21 +1,30 @@
-Unreleased:
-* Update to Hardrules 2.0
-    * Rules can be parametrized with `--rules_config config.yaml`
-    * Some rules have been refactored with better names.
-    * `--run_all_rules` mode to run each rule instead of stoppping at first discard
-    * Language identification with [FastSpell](https://github.com/mbanon/fastspell)
-* Huge memory improvements during training.
+## Unreleased:
+### Added
 * Hide Tensorflow and Transformers logging messages in executable scripts.
-* Update HF Transformers, no longer needed single GPU for prediction.
+* Redirect Keras prediction progress bar to stderr.
+* Huge memory improvements during training.
+### Changed
+* Update to Hardrules 2.3
+  * Rules can be parametrized with `--rules_config config.yaml`
+  * Some rules have been refactored with better names.
+  * `--run_all_rules` mode to run each rule instead of stoppping at first discard
+  * Language identification with [FastSpell](https://github.com/mbanon/fastspell)
+  * Easier installation! Now KenLM comes pre-compiled.
+* Now BICLEANER\_AI\_THREADS environment variable controls the number of threads.
+* Update HF Transformers.
+* Update TensorFlow minimum version.
+* Set inter/intra\_op parallelism to 0 by default.
+* Add citation info to README.
+### Fixed
 * Avoid generating empty sentences in omit noise.
 * Restore capital letters at the beggining of the sentennce in frequency noise.
 * Fix loading lite models in other other Python versions than 3.8.
 * Other minor fixes.
 
-Bicleaner AI 1.0.1:
+## Bicleaner AI 1.0.1:
 * Update hardrules to 1.2: adds score only mode.
 
-Bicleaner AI 1.0:
+## Bicleaner AI 1.0:
 * Bicleaner train changes:
   * Separate most of the training logic in the BaseModel class.
   * Re-factor synthetic noise build function.
@@ -27,82 +36,3 @@ Bicleaner AI 1.0:
 * Bicleaner classify changes:
   * Change old classifier by new neural models.
   * Move hardrules into a separate package.
-
-Bicleaner 0.15:
-* Bicleaner train changes:
-  * Qmax bug fixing.
-  * Classifier training uses the number of processes given by argument.
-* Bicleaner classify changes:
-  * Refactored classifier scripts: code cleaning and remove lot of duplicated code.
-  * Buffered tokenization: improve speed of external tokenization tokenizing blocks of lines instead of line by line.
-
-Bicleaner 0.14: 
-* Bicleaner hardrules changes:
-  * New rule: filter out sentences containing gluedWordsLikeThis.
-  * Rule change: Relaxed c_different_language rule for similar languages.
-  * New rule: filter out porn sentences using FastText classifier.
-  * Parameters changed: `-s/--source_lang` and `-t/--target_lang` are no longer mandatory (if a metadata .yaml file is provided)
-* Bicleaner train changes:
-  * Default classifier is now `extra_trees`
-  * New parameters: `-f`  and `-F`, source and target word frequency dictionaries.
-  * New qmax features:
-    * `qmax_nosmooth_nolimit_freq`: removes OOV smoothing, word limits and weights each target word with its monolingual probability using the word frequency dictionary.
-    * `qmax_nosmooth_nolimit_cummulated_prob_zipf_freq`: uses accumulated probability instead of maximum and splits the score into quartiles based on word frequencies.
-  * Added more bilingual dictionary coverage features, splitting them into quartiles based on monolingual word frequencies.
-  * Added new noise function that synthesizes negative samples cutting sentences and replacing words (this is not used by default, needs more testing).
-  * Changed classifier training behavior and use grid search.
-  * Removed `bicleaner_train_lite.py`
-  * Removed parameters: `-g` (`--good_examples`) and `-w` (`--wrong_examples`):
-    * Now, training automatically uses one half of the input file for good examples and the other half to synthesize wrong examples.
-    * Of this partitions, 90% will be used for training and the remaining 10% for testing.
-  * New parameter: `--relative_paths` allows to save model files paths relative instead of absolute  (useful for training distributable models)
-  * Changed logging info messages, now more informative.
-* Other
-   * Now using [sacremoses](https://github.com/alvations/sacremoses) instead of [mosestokenizer](https://github.com/luismsgomes/mosestokenizer)
-   * New script:  `./utils/download-pack.sh` allows to download language packs for a given language pair.
-
-
-Bicleaner 0.13:
-* Bicleaner hardrules changes:
-  * Rule change: Relaxed c_minimal_length to accept 3-word sentences	
-  * New feature: LM filtering (moved from Bicleaner Classify)
-  * New parameter: `--disable_lm_filter`, `--metadata` and `--lm_threshold`, to support LM filtering
-* Bicleaner training changes: 
-  * New parameter: Features relying on language identification can be disabled with flag `--disable_lang_ident` (this will be outputed in the .yaml file and used by Bicleaner clasifier)
-  * New feature: Debug mode now gives information on random forest feature importances
-  * Parameter change: --noisy_examples_file_sl and --noisy_examples_file_tl are now optional
-  * Parameter change: input now must be more than 10K sentences long
-  * Removed INFO messages when processes starting/ending (except when debugging)
-* Bicleaner classifier changes:
-  * `--disable_lang_ident` flag is now read from the .yaml file
-  * Removed feature: LM filtering (moved to Bicleaner Hardrules)
-  * New parameter: `--disable_lm_filter`
-  * Removed parameters: `--keep_lm_result`,  `--threshold`
-* Other:
-  * Updated requirements
-  
-  
-  
-Bicleaner 0.12:
-* Bicleaner hardrules changes:
-  * New rule: c_identical_wo_punct to reject sentences only different in punctuation (and it's case insensitive)
-  * New rule:  Sentences containing "Re:" are rejected
-  * Rule change: c_minimal_length now rejects sentences with both sides <= 3 words (instead of only one)
-  * Rule change: c_identical and c_identical_wo_digits now is case insensitive
-  * Rule change: Breadcrumbs rule now split into c_no_breadcrumbs1 and c_no_breadcrumbs2
-  * Rule change: Breadcrumbs2 now includes character "·" in the rejected characters
-  * Rule change: c_length now compares byte length ratio (will avoid rejecting valid sentences due to length ratio when comparing languages with different alphabets)
-  * Changed behaviour for `--annotated_output` argument in hardrules. See README.md for more information.
-  * New parameter: `--disable_lang_ident` flag to avoid applying rules that need to identify the language
-* Bicleaner classify changes:  
-  * Now using only 3 decimal places for Bicleaner score and LM score
-  * Removed INFO messages when processes starting/ending (except when debugging)
-  * New parameter: '--disable_hardrules' flag to avoid applying hardrules
-  * New parameter: '--disable_lang_ident' flag to avoid applying rules that need to identify the language
-  * New parameter: '--score_only' flag to output only Bicleaner scores (proposed by [@kirefu](https://github.com/kirefu))
-* Bicleaner features changes:
-  * Fixed bug when probability in prob_dict is 0 (issue [#19](https://github.com/bitextor/bicleaner/issues/19))
-* Other:
-  * Fixed sklearn version to 0.19.1
-  * Added utilities for training: `shuffle.py` and `dict_pruner.py`
-  * Updated instalation guides in readme
author	ZJaume <jzaragoza@prompsit.com>	2022-07-27 14:28:15 +0300
committer	ZJaume <jzaragoza@prompsit.com>	2022-07-27 14:28:15 +0300
commit	54a67245c43832325707c16e3ca521e429bc72e3 (patch)
tree	477aec396391e94fd211b4deed9c36d8c5207ca9
parent	eaf7036f1f4a4d4813dce939553d0304f83bbbee (diff)