{ "info": { "author": "Alexander Kukushkin", "author_email": "alex@alexkuk.ru", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3" ], "description": "\n\n![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel)\n\n`razdel` \u2014 rule-based system for Russian sentence and word tokenization..\n\n## Usage\n\n```python\n>>> from razdel import tokenize\n\n>>> tokens = list(tokenize('\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441 \u043d\u0430 0.5\u043b (50/64 \u0441\u043c\u00b3, 516;...)'))\n>>> tokens\n[Substring(0, 13, '\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441'),\n Substring(14, 16, '\u043d\u0430'),\n Substring(17, 20, '0.5'),\n Substring(20, 21, '\u043b'),\n Substring(22, 23, '(')\n ...]\n\n>>> [_.text for _ in tokens]\n['\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441', '\u043d\u0430', '0.5', '\u043b', '(', '50/64', '\u0441\u043c\u00b3', ',', '516', ';', '...', ')']\n```\n\n```python\n>>> from razdel import sentenize\n\n>>> text = '''\n... - \"\u0422\u0430\u043a \u0432 \u0447\u0435\u043c \u0436\u0435 \u0434\u0435\u043b\u043e?\" - \"\u041d\u0435 \u0440\u0430-\u0434\u0443-\u044e\u0442\".\n... \u0418 \u0442. \u0434. \u0438 \u0442. \u043f. \u0412 \u043e\u0431\u0449\u0435\u043c, \u0432\u0441\u044f \u0433\u0430\u0437\u0435\u0442\u0430\n... '''\n\n>>> list(sentenize(text))\n[Substring(1, 23, '- \"\u0422\u0430\u043a \u0432 \u0447\u0435\u043c \u0436\u0435 \u0434\u0435\u043b\u043e?\"'),\n Substring(24, 40, '- \"\u041d\u0435 \u0440\u0430-\u0434\u0443-\u044e\u0442\".'),\n Substring(41, 56, '\u0418 \u0442. \u0434. \u0438 \u0442. \u043f.'),\n Substring(57, 76, '\u0412 \u043e\u0431\u0449\u0435\u043c, \u0432\u0441\u044f \u0433\u0430\u0437\u0435\u0442\u0430')]\n```\n\n## Installation\n\n`razdel` supports Python 3.5+ and PyPy 3.\n\n```bash\n$ pip install razdel\n```\n\n## Quality, performance\n\n\nUnfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?! \u0417\u0430\u0445\u0430\u0440...\u00bb \u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.` into three sentences `[\"\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?!\", \"\u0417\u0430\u0445\u0430\u0440...\u00bb\", \"\u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.\"]` while `razdel` splits it into two `[\"\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?!\", \"\u0417\u0430\u0445\u0430\u0440...\u00bb \u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.\"]`. What would be the correct way to tokenizer `\u0442.\u0435.`? One may split in into `\u0442.|\u0435.`, `razdel` splits into `\u0442|.|\u0435|.`.\n\n`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.\n\nWe measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `\u0447\u0443\u0442\u044c-\u0447\u0443\u0442\u044c?!` is not non-trivial, one may split it into `\u0447\u0443\u0442\u044c|-|\u0447\u0443\u0442\u044c|?|!` while the correct tokenization is `\u0447\u0443\u0442\u044c-\u0447\u0443\u0442\u044c|?!`, such examples are rare. Vast majority of cases are trivial, for example text `\u0432 5 \u0447\u0430\u0441\u043e\u0432 ...` is correctly tokenized even via Python native `str.split` into `\u0432| |5| |\u0447\u0430\u0441\u043e\u0432| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.\n\n`errors` \u2014 number of errors. For example, consider etalon segmentation is `\u0447\u0442\u043e-\u0442\u043e|?`, prediction is `\u0447\u0442\u043e|-|\u0442\u043e?`, then the number of errors is 3: 1 for missing split `\u0442\u043e?` + 2 for extra splits `\u0447\u0442\u043e|-|\u0442\u043e`.\n\n`time` \u2014 total seconds taken.\n\n`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb.\n\n### Tokens\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.findall(\\w+|\\d+|\\p+)41610.526600.522770.476060.4
spacy43886.221035.817404.140573.9
nltk.word_tokenize142453.4608933.3134962.7414852.9
mystem45145.031534.724973.720283.9
mosestokenizer18862.113301.917961.621231.7
segtok.word_tokenize27722.312882.317591.812291.8
aatimofeev/spacy_russian_tokenizer293048.771951.167839.5268152.2
koziev/rutokenizer26271.113861.028930.894110.9
razdel.tokenize15102.914832.83222.021242.2
\n\n\n### Sentencies\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.split([.?!\u2026])204560.965760.6100840.7233561.0
segtok.split_single1900817.8442213.41597381.11642182.8
mosestokenizer416668.9220825.7126636.4505607.4
nltk.sent_tokenize1642010.143505.370745.6325348.9
deeppavlov/rusenttokenize1019210.912107.989106.8214107.0
razdel.sentenize92746.18243.9114144.5105947.5
\n\n\n## Support\n\n- Chat \u2014 https://telegram.me/natural_language_processing\n- Issues \u2014 https://github.com/natasha/razdel/issues\n\n## Development\n\nTest:\n\n```bash\npip install -e .\npip install -r requirements/ci.txt\nmake test\nmake int # 2000 integration tests\n```\n\nPackage:\n\n```bash\nmake version\ngit push\ngit push --tags\n\nmake clean wheel upload\n```\n\n`mystem` errors on `syntag`:\n\n```bash\n# see naeval/data\ncat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less\n```\n\nNon-trivial token tests:\n\n```bash\npv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt\npv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt\n```\n\nUpdate integration tests:\n\n```bash\ncd razdel/tests/data/\npv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt\n```\n\n`razdel` and `moses` diff:\n\n```bash\ncat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less\n```\n\n`razdel` performance:\n\n```bash\ncat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/natasha/razdel", "keywords": "nlp,natural language processing,russian,token,sentence,tokenize", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "razdel", "package_url": "https://pypi.org/project/razdel/", "platform": "", "project_url": "https://pypi.org/project/razdel/", "project_urls": { "Homepage": "https://github.com/natasha/razdel" }, "release_url": "https://pypi.org/project/razdel/0.5.0/", "requires_dist": null, "requires_python": "", "summary": "Splits russian text into tokens, sentences, section. Rule-based", "version": "0.5.0", "yanked": false, "yanked_reason": null }, "last_serial": 6886665, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "31b6b952555d51aa57b51bf1508ba6bf", "sha256": "d7252dd4070228a8badf7673fb251a45e4c666641c523c787ff9cec1aedf27a5" }, "downloads": -1, "filename": "razdel-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "31b6b952555d51aa57b51bf1508ba6bf", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 20155, "upload_time": "2018-11-10T19:04:39", "upload_time_iso_8601": "2018-11-10T19:04:39.986316Z", "url": "https://files.pythonhosted.org/packages/4a/df/fc809393ce7398ae6f7a175f121a0853e5d7add8bc6fb33c9c7aff7d86ec/razdel-0.1.0-py2.py3-none-any.whl", "yanked": false, "yanked_reason": null } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "a63b29ec7ed398d88c87aab0dc422e32", "sha256": "fcb632549610386558a1bea2a91985006cdbe4da350c9dfa9ef6d2a7e2cecfb6" }, "downloads": -1, "filename": "razdel-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "a63b29ec7ed398d88c87aab0dc422e32", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 20886, "upload_time": "2018-11-19T07:30:23", "upload_time_iso_8601": "2018-11-19T07:30:23.911359Z", "url": "https://files.pythonhosted.org/packages/7a/aa/087c57a1974f9295ea6b2aaead0fac002bdb1d648bfaa5a1977b852a9871/razdel-0.2.0-py2.py3-none-any.whl", "yanked": false, "yanked_reason": null } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "154a8603fc5db8fb98d483e61bc6a3ca", "sha256": "c94ccf688a212df409359b69c9258f6b3e7e091ea8c6212576454797a2b2f0c3" }, "downloads": -1, "filename": "razdel-0.3.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "154a8603fc5db8fb98d483e61bc6a3ca", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 20943, "upload_time": "2018-11-26T12:25:52", "upload_time_iso_8601": "2018-11-26T12:25:52.352908Z", "url": "https://files.pythonhosted.org/packages/94/c2/742bc726aad693c964b051481f0cb71937556885e25201fab1c7f8fae0b8/razdel-0.3.0-py2.py3-none-any.whl", "yanked": false, "yanked_reason": null } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "b5d3669e10b257cbfee33c43b4bab18b", "sha256": "7464ee93b1e68c4ff60a10faf7065e3ffe5c5aabba0a86c2027b52e97f6e30a3" }, "downloads": -1, "filename": "razdel-0.4.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b5d3669e10b257cbfee33c43b4bab18b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 20977, "upload_time": "2019-06-16T11:50:01", "upload_time_iso_8601": "2019-06-16T11:50:01.289174Z", "url": "https://files.pythonhosted.org/packages/cf/f0/664eb27854d7de7c3605b5cd2a155cf069143fb00902ac479325bf1a98b7/razdel-0.4.0-py2.py3-none-any.whl", "yanked": false, "yanked_reason": null } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "0c7a610b55ce5fd47c204e5978010b33", "sha256": "76f59691c3216b47d32fef6274c18c12d61f602f1444b7ef4b135b03801f6d37" }, "downloads": -1, "filename": "razdel-0.5.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0c7a610b55ce5fd47c204e5978010b33", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21149, "upload_time": "2020-03-26T04:27:52", "upload_time_iso_8601": "2020-03-26T04:27:52.591609Z", "url": "https://files.pythonhosted.org/packages/15/2c/664223a3924aa6e70479f7d37220b3a658765b9cfe760b4af7ffdc50d38f/razdel-0.5.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "638852a3b703aaa57927e1e40a1a74dc", "sha256": "4334c0fdfe34d4e888cf0ed854968c9df14f0a547df909a77f4634f9ffe626e6" }, "downloads": -1, "filename": "razdel-0.5.0.tar.gz", "has_sig": false, "md5_digest": "638852a3b703aaa57927e1e40a1a74dc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19248, "upload_time": "2020-03-26T04:27:54", "upload_time_iso_8601": "2020-03-26T04:27:54.331031Z", "url": "https://files.pythonhosted.org/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0c7a610b55ce5fd47c204e5978010b33", "sha256": "76f59691c3216b47d32fef6274c18c12d61f602f1444b7ef4b135b03801f6d37" }, "downloads": -1, "filename": "razdel-0.5.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0c7a610b55ce5fd47c204e5978010b33", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21149, "upload_time": "2020-03-26T04:27:52", "upload_time_iso_8601": "2020-03-26T04:27:52.591609Z", "url": "https://files.pythonhosted.org/packages/15/2c/664223a3924aa6e70479f7d37220b3a658765b9cfe760b4af7ffdc50d38f/razdel-0.5.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "638852a3b703aaa57927e1e40a1a74dc", "sha256": "4334c0fdfe34d4e888cf0ed854968c9df14f0a547df909a77f4634f9ffe626e6" }, "downloads": -1, "filename": "razdel-0.5.0.tar.gz", "has_sig": false, "md5_digest": "638852a3b703aaa57927e1e40a1a74dc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19248, "upload_time": "2020-03-26T04:27:54", "upload_time_iso_8601": "2020-03-26T04:27:54.331031Z", "url": "https://files.pythonhosted.org/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }