From 38c9ed5b042ae488ee12287bf8c19457189889aa Mon Sep 17 00:00:00 2001 From: Lars-Dominik Braun Date: Fri, 8 Nov 2019 16:06:37 +0100 Subject: Add OpenSubtitles corpus See issue #5. --- doc/index.html | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'doc') diff --git a/doc/index.html b/doc/index.html index 19151b0..a390ddf 100644 --- a/doc/index.html +++ b/doc/index.html @@ -132,6 +132,8 @@ Arabic, another Arabic-language news site
  • 116,754 documents from the United Nations Parallel Corpus v1.0
  • +
  • subtitles from 94,093 movies based on a + 2018 OpenSubtitles dump
  • 1,709 ebooks from hindawi.org
  • and a plain-text copy of the Quran from

    summing up to roughly - 825 million words or - 5.5 billion characters. + 1.2 billion words or + 7.6 billion characters. The plot below shows ا ل ي م و ن can be considered the most frequently used letters in the Arabic language. -- cgit v1.2.3