-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdummy.json
99 lines (99 loc) · 3.53 KB
/
dummy.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
{
"Name": "CIDAR",
"Subsets": [],
"Link": "https://hf.co/datasets/arbml/CIDAR",
"HF Link": "https://hf.co/datasets/arbml/CIDAR",
"License": "CC BY-NC 4.0",
"Year": 2024,
"Language": "ar",
"Dialect": "Modern Standard Arabic",
"Domain": [
"commentary",
"LLM"
],
"Form": "text",
"Collection Style": [
"crawling",
"LLM generated",
"manual curation"
],
"Description": "CIDAR contains 10,000 instructions and their output. The dataset was created by selecting around 9,109 samples from Alpagasus dataset then translating it to Arabic using ChatGPT. In addition, we append that with around 891 Arabic grammar instructions from the webiste Ask the teacher. All the 10,000 samples were reviewed by around 12 reviewers.",
"Volume": 10000,
"Unit": "sentences",
"Ethical Risks": "Low",
"Provider": [
"ARBML"
],
"Derived From": [
"AlpaGasus"
],
"Paper Title": "CIDAR: Culturally Relevant Instruction Dataset For Arabic",
"Paper Link": "https://arxiv.org/pdf/2402.03177",
"Script": "Arab",
"Tokenized": "No",
"Host": "HuggingFace",
"Access": "Free",
"Cost": "",
"Test Split": "No",
"Tasks": [
"instruction tuning",
"question answering"
],
"Venue Title": "arXiv",
"Citations": 0,
"Venue Type": "preprint",
"Venue Name": "",
"Authors": [
"Zaid Alyafeai",
"Khalid Almubarak",
"Ahmed Ashraf",
"Deema Alnuhait",
"Saied Alshahrani",
"Gubran A. Q. Abdulrahman",
"Gamil Ahmed",
"Qais Gawah",
"Zead Saleh",
"Mustafa Ghaleb",
"Yousef Ali",
"Maged S. Al-Shaibani"
],
"Affiliations": [],
"Abstract": "Instruction tuning has emerged as a prominent methodology for teaching Large Language Models (LLMs) to follow instructions.\nHowever, current instruction datasets predominantly cater to English or are derived from\nEnglish-dominated LLMs, resulting in inherent\nbiases toward Western culture. This bias significantly impacts the linguistic structures of\nnon-English languages such as Arabic, which\nhas a distinct grammar reflective of the diverse cultures across the Arab region. This\npaper addresses this limitation by introducing CIDAR1\nthe first open Arabic instructiontuning dataset culturally-aligned by human reviewers. CIDAR contains 10,000 instruction\nand output pairs that represent the Arab region. We discuss the cultural relevance of\nCIDAR via the analysis and comparison to other\nmodels fine-tuned on other datasets. Our experiments show that CIDAR can help enrich\nresearch efforts in aligning LLMs with the\nArabic culture. All the code is avail",
"Added By": "Zaid Alyafeai",
"annotations_from_paper":{
"Name": 1,
"Subsets": 1,
"HF Link": 1,
"Link": 1,
"License": 0,
"Year": 1,
"Language": 1,
"Dialect": 1,
"Domain": 1,
"Form": 1,
"Collection Style": 1,
"Description": 1,
"Volume": 1,
"Unit": 1,
"Ethical Risks": 1,
"Provider": 1,
"Derived From": 1,
"Paper Title": 1,
"Paper Link": 1,
"Script": 1,
"Tokenized": 1,
"Host": 1,
"Access": 1,
"Cost": 1,
"Test Split": 1,
"Tasks": 1,
"Venue Title": 1,
"Citations": 0,
"Venue Type": 1,
"Venue Name": 1,
"Authors": 1,
"Affiliations": 1,
"Abstract": 1,
"Added By": 0
}
}