diff options
author | Lars-Dominik Braun <lars@6xq.net> | 2019-11-10 09:44:35 +0100 |
---|---|---|
committer | Lars-Dominik Braun <lars@6xq.net> | 2019-11-16 10:17:26 +0100 |
commit | 14daa5644598836fd6321038c6b0a496c7874374 (patch) | |
tree | 2309443ffec0acea7b8eda095dba17436c0deb0a /doc/index.html | |
parent | 38c9ed5b042ae488ee12287bf8c19457189889aa (diff) | |
download | lulua-14daa5644598836fd6321038c6b0a496c7874374.tar.gz lulua-14daa5644598836fd6321038c6b0a496c7874374.tar.bz2 lulua-14daa5644598836fd6321038c6b0a496c7874374.zip |
doc: Auto-generate corpus table
Diffstat (limited to 'doc/index.html')
-rw-r--r-- | doc/index.html | 33 |
1 files changed, 5 insertions, 28 deletions
diff --git a/doc/index.html b/doc/index.html index a390ddf..e930892 100644 --- a/doc/index.html +++ b/doc/index.html @@ -7,9 +7,8 @@ <meta name="viewport" content="width=device-width, initial-scale=1"> <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono|IBM+Plex+Sans:100,400&display=swap" rel="stylesheet"> - <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/base-min.css" crossorigin="anonymous"> - <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-min.css" crossorigin="anonymous"> - <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-responsive-min.css" crossorigin="anonymous"> + <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/pure-min.css" integrity="sha384-oAOxQR6DkCoMliIh8yFnu25d7Eq/PHS21PClpwjOTeU2jRSq11vu66rf90/cZr47" crossorigin="anonymous"> + <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-responsive-min.css"> <script src="https://cdn.pydata.org/bokeh/release/bokeh-1.3.4.min.js"></script> <link rel="stylesheet" href="style.css"> </head> @@ -119,32 +118,10 @@ <!-- --> The corpus used for the following analysis consists of </p> - <ul> - <li><a href="https://dumps.wikimedia.org/arwiki/20190701/">a - dump</a> of the <a href="https://ar.wikipedia.org/">Arabic - Wikipedia</a> as of July 2019, extracted using - <a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a> - containing 857,386 articles</li> - <li>547,110 articles from - <a href="https://www.aljazeera.net/">aljazeera.net</a>, an - Arabic-language news site</li> - <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC - Arabic</a>, another Arabic-language news site</li> - <li>116,754 documents from the - <a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li> - <li>subtitles from 94,093 movies based on a - <a href="http://opus.nlpl.eu/OpenSubtitles-v2018.php">2018 OpenSubtitles dump</a></li> - <li>1,709 ebooks from <a - href="https://www.hindawi.org/books">hindawi.org</a></li> - <li>and a plain-text copy of the Quran from <a - href="http://tanzil.net/docs/download">tanzil.net</a> using the - options Simple Enhanced and Text (for inclusion of diacritics)</li> - </ul> + + #include "corpus.html" + <p> - summing up to roughly - 1.2 billion words or - 7.6 billion characters. <!-- == combined button presses --> - <!-- --> The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be considered the most frequently used letters in the Arabic language. <!-- --> |