diff options
author | Lars-Dominik Braun <lars@6xq.net> | 2019-11-08 16:06:37 +0100 |
---|---|---|
committer | Lars-Dominik Braun <lars@6xq.net> | 2019-11-08 21:34:15 +0100 |
commit | 38c9ed5b042ae488ee12287bf8c19457189889aa (patch) | |
tree | d4f49039eec711aa7c9ee21c691f46bc89316e48 /doc | |
parent | e31d8731531b41a909bfe33ddc134de07f0a7bab (diff) | |
download | lulua-38c9ed5b042ae488ee12287bf8c19457189889aa.tar.gz lulua-38c9ed5b042ae488ee12287bf8c19457189889aa.tar.bz2 lulua-38c9ed5b042ae488ee12287bf8c19457189889aa.zip |
Add OpenSubtitles corpus
See issue #5.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/index.html | 6 |
1 files changed, 4 insertions, 2 deletions
diff --git a/doc/index.html b/doc/index.html index 19151b0..a390ddf 100644 --- a/doc/index.html +++ b/doc/index.html @@ -132,6 +132,8 @@ Arabic</a>, another Arabic-language news site</li> <li>116,754 documents from the <a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li> + <li>subtitles from 94,093 movies based on a + <a href="http://opus.nlpl.eu/OpenSubtitles-v2018.php">2018 OpenSubtitles dump</a></li> <li>1,709 ebooks from <a href="https://www.hindawi.org/books">hindawi.org</a></li> <li>and a plain-text copy of the Quran from <a @@ -140,8 +142,8 @@ </ul> <p> summing up to roughly - 825 million words or - 5.5 billion characters. <!-- == combined button presses --> + 1.2 billion words or + 7.6 billion characters. <!-- == combined button presses --> <!-- --> The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be considered the most frequently used letters in the Arabic language. |