diff options
author | TharinduDR <rhtdranasinghe@gmail.com> | 2021-05-24 19:23:59 +0300 |
---|---|---|
committer | TharinduDR <rhtdranasinghe@gmail.com> | 2021-05-24 19:23:59 +0300 |
commit | 4fbb3aea38c20246c2bb7f33160940aad7aae782 (patch) | |
tree | 31fbc85867a03b1741ba58b42037d32f16909479 | |
parent | c048d7107e0ba2db503cf54360059fcc3350241a (diff) |
057: Code Refactoring - Siamese Architectures
-rw-r--r-- | README.md | 17 | ||||
-rw-r--r-- | docs/architectures/sentence_level_architectures.md | 66 | ||||
-rw-r--r-- | docs/index.md | 16 |
3 files changed, 38 insertions, 61 deletions
@@ -25,11 +25,24 @@ With TransQuest, we have opensourced our research in translation quality estimat ## Resources +* [COLING Presentation](https://youtu.be/WVgitropUyE) done on December, 2020. - [Research Seminar](https://youtu.be/xbsbHUVVF3s) done on 1st of October 2020 in [RGCL](http://rgcl.wlv.ac.uk/2020/09/24/research-seminar/) and the [slides](https://www.slideshare.net/TharinduRanasinghe1/transquest-238713809). + ## Citations -If you are using the package, please consider citing this paper which is accepted to [COLING 2020](https://coling2020.org/) +If you are using the word-level architecture, please consider citing this paper which is accepted to [ACL 2021](https://2021.aclweb.org/). + +```bash +@InProceedings{ranasinghe2021, +author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan}, +title = {An Exploratory Analysis of Multilingual Word Level Quality Estimation with Cross-Lingual Transformers}, +booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics}, +year = {2021} +} +``` + +If you are using the sentence-level architectures, please consider citing these papers which were presented in [COLING 2020](https://coling2020.org/) and in [WMT 2020](http://www.statmt.org/wmt20/) at EMNLP 2020. ```bash @InProceedings{transquest:2020a, @@ -39,8 +52,6 @@ booktitle = {Proceedings of the 28th International Conference on Computational L year = {2020} } ``` - -If you are using the task specific fine tuning, please consider citing this which is accepted to [WMT 2020](http://www.statmt.org/wmt20/) at EMNLP 2020. ```bash @InProceedings{transquest:2020b, diff --git a/docs/architectures/sentence_level_architectures.md b/docs/architectures/sentence_level_architectures.md index 0dc8f86..77bd1dd 100644 --- a/docs/architectures/sentence_level_architectures.md +++ b/docs/architectures/sentence_level_architectures.md @@ -57,71 +57,27 @@ Then the output of all the word embeddings goes through a mean pooling layer. Af ### Minimal Start for a SiameseTransQuest Model -First save your train/dev pandas dataframes to csv files in a single folder. We refer the path to that folder as "path" in the code below. You have to provide the indices of source, target and quality labels when reading with the QEDataReader class. - +Initiate and train the model like in the following code. train_df and eval_df are the pandas dataframes prepared with the instructions in Data Preparation section. ```python -from transquest.algo.sentence_level.siamesetransquest import LoggingHandler, SentencesDataset, \ - SiameseTransQuestModel -from transquest.algo.sentence_level.siamesetransquest import models, losses -from transquest.algo.sentence_level.siamesetransquest.evaluation import EmbeddingSimilarityEvaluator -from transquest.algo.sentence_level.siamesetransquest.readers import QEDataReader -from torch.utils.data import DataLoader -import math - -qe_reader = QEDataReader(path, s1_col_idx=0, s2_col_idx=1, - score_col_idx=2, - normalize_scores=False, min_score=0, max_score=1, header=True) - -word_embedding_model = models.Transformer("xlm-roberta-large", max_seq_length=siamesetransquest_config[ - 'max_seq_length']) - -pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), - pooling_mode_mean_tokens=True, - pooling_mode_cls_token=False, - pooling_mode_max_tokens=False) - -model = SiameseTransQuestModel(modules=[word_embedding_model, pooling_model]) -train_data = SentencesDataset(qe_reader.get_examples('train.tsv'), model) -train_dataloader = DataLoader(train_data, shuffle=True, - batch_size=siamesetransquest_config['train_batch_size']) -train_loss = losses.CosineSimilarityLoss(model=model) - -eval_data = SentencesDataset(examples=qe_reader.get_examples('eval_df.tsv'), model=model) -eval_dataloader = DataLoader(eval_data, shuffle=False, - batch_size=siamesetransquest_config['train_batch_size']) -evaluator = EmbeddingSimilarityEvaluator(eval_dataloader) - -warmup_steps = math.ceil( - len(train_data) * siamesetransquest_config["num_train_epochs"] / siamese_transformer_config[ - 'train_batch_size'] * 0.1) - - -model.fit(train_objectives=[(train_dataloader, train_loss)], - evaluator=evaluator, - epochs=siamesetransquest_config['num_train_epochs'], - evaluation_steps=100, - optimizer_params={'lr': siamesetransquest_config["learning_rate"], - 'eps': siamesetransquest_config["adam_epsilon"], - 'correct_bias': False}, - warmup_steps=warmup_steps, - output_path=siamesetransquest_config['best_model_dir']) +from transquest.algo.sentence_level.siamesetransquest.run_model import SiameseTransQuestModel +model = SiameseTransQuestModel(MODEL_NAME, args=siamesetransquest_config) +model.train_model(train_df, eval_df) ``` -An example siamese_transformer_config is available [here.](https://github.com/TharinduDR/TransQuest/blob/master/examples/wmt_2020/ro_en/siamese_transformer_config.py). The best model will be saved to the path specified in the "best_model_dir" in siamesetransquest_config. Then you can load it and do the predictions like this. +An example siamese_transformer_config is available [here.](https://github.com/TharinduDR/TransQuest/blob/master/examples/sentence_level/wmt_2020/ro_en/siamesetransquest_config.py). The best model will be saved to the path specified in the "best_model_dir" in siamesetransquest_config. Then you can load it and do the predictions like this. ```python -test_data = SentencesDataset(examples=qe_reader.get_examples("test.tsv", test_file=True), model=model) - test_dataloader = DataLoader(test_data, shuffle=False, batch_size=8) - evaluator = EmbeddingSimilarityEvaluator(test_dataloader) +from transquest.algo.sentence_level.siamesetransquest.run_model import SiameseTransQuestModel + +model = SiameseTransQuestModel(siamesetransquest_config['best_model_dir']) - model.evaluate(evaluator, - result_path=os.path.join(siamesetransquest_config['cache_dir'], "test_result.txt"), - verbose=False) +predictions, raw_outputs = model.predict([[source, target]]) +print(predictions) ``` -You will find the predictions in the test_result.txt file in the siamesetransquest_config['cache_dir'] folder. +Predictions are the predicted quality scores. !!! tip Now that you know about the architectures in TransQuest, check how we can apply it in WMT QE shared tasks [here.](https://tharindudr.github.io/TransQuest/examples/sentence_level_examples/)
\ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 530342f..1b8ed27 100644 --- a/docs/index.md +++ b/docs/index.md @@ -26,11 +26,23 @@ With TransQuest, we have opensourced our research in translation quality estimat 5. **[Contact](https://tharindudr.github.io/TransQuest/contact/)** - Contact us for any issues with TransQuest ## Resources +* [COLING Presentation](https://youtu.be/WVgitropUyE) done on December, 2020. - [Research Seminar](https://youtu.be/xbsbHUVVF3s) done on 1st of October 2020 in [RGCL](http://rgcl.wlv.ac.uk/2020/09/24/research-seminar/) and the [slides](https://www.slideshare.net/TharinduRanasinghe1/transquest-238713809). ## Citations -If you are using the package, please consider citing this paper which is accepted to [COLING 2020](https://coling2020.org/) +If you are using the word-level architecture, please consider citing this paper which is accepted to [ACL 2021](https://2021.aclweb.org/). + +```bash +@InProceedings{ranasinghe2021, +author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan}, +title = {An Exploratory Analysis of Multilingual Word Level Quality Estimation with Cross-Lingual Transformers}, +booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics}, +year = {2021} +} +``` + +If you are using the sentence-level architectures, please consider citing these papers which were presented in [COLING 2020](https://coling2020.org/) and in [WMT 2020](http://www.statmt.org/wmt20/) at EMNLP 2020. ```bash @InProceedings{transquest:2020a, @@ -40,8 +52,6 @@ booktitle = {Proceedings of the 28th International Conference on Computational L year = {2020} } ``` - -If you are using the task specific fine tuning, please consider citing this which is accepted to [WMT 2020](http://www.statmt.org/wmt20/) at EMNLP 2020. ```bash @InProceedings{transquest:2020b, |