-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcamp2023-57163-eng-Unsupervised_Pleasures_opus.srt
2900 lines (2175 loc) · 65.2 KB
/
camp2023-57163-eng-Unsupervised_Pleasures_opus.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,000 --> 00:00:10,000
[MUSIC]
2
00:00:10,000 --> 00:00:20,000
[MUSIC]
3
00:00:20,000 --> 00:00:35,200
And now our next talk is from Sarah Sisten,
4
00:00:35,200 --> 00:00:38,960
who is based in Berlin in Los Angeles.
5
00:00:38,960 --> 00:00:43,800
And is a PhD candidate of University of Southern California.
6
00:00:43,800 --> 00:00:50,760
And who will talk about unsupervised pleasures and
7
00:00:50,760 --> 00:00:54,760
its intersectional language models for queer futures.
8
00:00:54,760 --> 00:00:55,680
Welcome on stage please.
9
00:00:55,680 --> 00:00:57,680
>> [APPLAUSE]
10
00:01:07,400 --> 00:01:09,400
>> Thank you for waking up, but
11
00:01:09,400 --> 00:01:11,400
[INAUDIBLE]
12
00:01:11,400 --> 00:01:14,000
And chugging, I hope you chug coffee, which I just.
13
00:01:14,000 --> 00:01:18,560
If you would like to introduce yourselves in the chat,
14
00:01:18,560 --> 00:01:21,680
I've set up a ether pad.
15
00:01:21,680 --> 00:01:23,600
I don't know if you can see the URL.
16
00:01:23,600 --> 00:01:31,600
It's pad.riseup.net/p/unsupervisedpleasurescc-keep.
17
00:01:31,600 --> 00:01:36,640
Add your favorite emoji, your pronoun, whatever.
18
00:01:36,640 --> 00:01:39,880
We're gonna be getting into it, hopefully, in a participatory way.
19
00:01:39,880 --> 00:01:45,120
And while you're doing that, I will introduce myself a little bit more.
20
00:01:45,120 --> 00:01:47,160
So I'm a poet and programmer.
21
00:01:47,160 --> 00:01:52,720
I'm interested in building tools to bring intersectional approaches to
22
00:01:52,720 --> 00:01:57,000
machine learning and building community through accessible,
23
00:01:57,000 --> 00:02:00,760
creative coding, critical creative coding.
24
00:02:00,760 --> 00:02:05,040
And I come by coding very circuitously via creative writing and
25
00:02:05,040 --> 00:02:07,200
scene making and book arts.
26
00:02:07,200 --> 00:02:14,600
So I have somehow come around to adapting that work into making subversive art with
27
00:02:14,600 --> 00:02:16,840
and about text-based machine learning.
28
00:02:16,840 --> 00:02:20,120
And I'll have the link up again in a few minutes.
29
00:02:20,120 --> 00:02:25,880
Let's get started.
30
00:02:25,880 --> 00:02:28,880
I can find my mouse.
31
00:02:28,880 --> 00:02:38,880
[BLANK_AUDIO]
32
00:02:38,880 --> 00:02:45,800
Ominous.
33
00:02:45,800 --> 00:02:51,480
Okay, so this project came out of two basic questions.
34
00:02:51,480 --> 00:02:56,920
This is a collaboration with my colleague who is called Queer AI,
35
00:02:56,920 --> 00:02:59,720
Emily Martinez, and we were really interested in,
36
00:02:59,720 --> 00:03:03,120
as these language models are coming about and getting really prominent,
37
00:03:03,120 --> 00:03:07,600
we actually started before chat GPT dropped and suddenly this has exploded.
38
00:03:07,600 --> 00:03:12,600
But we were wanting to know what do these existing language models have to say about
39
00:03:12,600 --> 00:03:16,520
people like us, and is it possible for
40
00:03:16,520 --> 00:03:19,760
language models to speak so that we recognize ourselves?
41
00:03:19,760 --> 00:03:25,760
We're really interested in building community tools around curated data sets
42
00:03:25,760 --> 00:03:30,080
that can acknowledge power and rethink these approaches, and
43
00:03:30,080 --> 00:03:32,440
thinking about new models and new goals.
44
00:03:32,440 --> 00:03:37,640
And so this workshop today is to think about what you might want to build with
45
00:03:37,640 --> 00:03:41,760
these systems, how we might make re-imagined data sets, and
46
00:03:41,760 --> 00:03:44,480
hopefully eventually re-imagined models as well.
47
00:03:44,480 --> 00:03:53,320
So these data sets are getting insanely large.
48
00:03:54,640 --> 00:03:59,960
At last count, GPT-4 and now GPT-5 are off the charts and
49
00:03:59,960 --> 00:04:02,760
they've stopped telling us what's even in them.
50
00:04:02,760 --> 00:04:06,800
Common Voice, which is from Mozilla and is crowd sourced,
51
00:04:06,800 --> 00:04:10,040
is 65 gigabytes of voice data.
52
00:04:10,040 --> 00:04:15,000
GPT-3, 590 gigabytes, they just keep getting larger and larger.
53
00:04:15,000 --> 00:04:21,520
Aside from the impacts in terms of sustainability and the environment,
54
00:04:23,120 --> 00:04:26,760
the issues that I'm seeing around this are that they're grabbing data
55
00:04:26,760 --> 00:04:30,520
indiscriminately, but they're still really doing a terrible job telling stories
56
00:04:30,520 --> 00:04:35,920
about people who don't fit these normalizing baselines that they're repeating.
57
00:04:35,920 --> 00:04:40,040
And my argument is that the solution to this is not to suck up more data
58
00:04:40,040 --> 00:04:45,640
carelessly, to make more categories, to find more ways to be labeled diverse,
59
00:04:45,640 --> 00:04:49,360
but to find other approaches that are actually more intersectional.
60
00:04:49,360 --> 00:04:52,880
So the size of these models means that they're pulling in racist text,
61
00:04:52,880 --> 00:04:56,560
inaccurate text, private text, all kinds of problematic texts.
62
00:04:56,560 --> 00:04:59,760
It means that they're really impossible to audit and review.
63
00:04:59,760 --> 00:05:05,080
And it's difficult to even develop criteria by which they should be reviewed or
64
00:05:05,080 --> 00:05:09,200
adjusted, ostensibly because they're called general and
65
00:05:09,200 --> 00:05:11,760
all purpose zero shot learners.
66
00:05:11,760 --> 00:05:16,040
But what this means is that they kind of only work for
67
00:05:16,040 --> 00:05:22,800
the Western white democratic rich so-called majority,
68
00:05:22,800 --> 00:05:25,600
but while leaving out the rest of the global majority.
69
00:05:25,600 --> 00:05:31,520
And this is a really totalizing approach that rather than representing a multitude
70
00:05:31,520 --> 00:05:36,200
of voices, it centers and normalizes and affirms this powerful status quo.
71
00:05:36,200 --> 00:05:42,920
So here's where these are coming from.
72
00:05:42,920 --> 00:05:46,360
We think about authorship in a new way.
73
00:05:46,360 --> 00:05:50,440
Common voice, as I said, is open source.
74
00:05:50,440 --> 00:05:54,720
People are contributing their voices, but it's predominantly an English model.
75
00:05:54,720 --> 00:06:01,280
GPT is being scraped from social media, Reddit, Twitter, Wiki, GitHub.
76
00:06:01,280 --> 00:06:07,520
The evaluation criteria for what was a good Reddit text to go into it was if it
77
00:06:07,520 --> 00:06:09,800
had a karma score of three or above.
78
00:06:09,800 --> 00:06:13,480
That's what's being decided as a good value for this,
79
00:06:13,480 --> 00:06:17,080
which I argue we could probably come up with some better rubric for this.
80
00:06:17,080 --> 00:06:21,440
T5 is from colossal clean crawled corpus,
81
00:06:21,440 --> 00:06:25,400
which is common crawl but filtered a little bit.
82
00:06:25,400 --> 00:06:30,040
And Wudow is three billion scraped Chinese social media texts and websites.
83
00:06:30,040 --> 00:06:35,640
So if you've ever posted anything on Twitter, on Reddit, on GitHub,
84
00:06:35,640 --> 00:06:39,840
your code and your text and your voice is somewhere in there.
85
00:06:39,840 --> 00:06:44,000
But it's probably not representing you either.
86
00:06:44,000 --> 00:06:53,480
Unfortunately, these data sets are also, when they're collected,
87
00:06:53,480 --> 00:06:57,880
they're not offering information about how the text arrived in this data set,
88
00:06:57,880 --> 00:06:59,840
which we'll talk about more a bit later.
89
00:06:59,840 --> 00:07:03,320
It's really showing you just a snippet of text, and
90
00:07:03,320 --> 00:07:08,800
it might say it came from Reddit, but it's not going to say anything more about
91
00:07:08,800 --> 00:07:13,320
who the author was, how it got there, what the rights were attached to that.
92
00:07:13,320 --> 00:07:20,240
So what I'd like us to do is to do an experiment.
93
00:07:20,240 --> 00:07:24,880
If you have a device that's connected to the Internet available,
94
00:07:24,880 --> 00:07:28,480
go to the Rise Up Pad address.
95
00:07:28,480 --> 00:07:34,800
And we're gonna talk through a couple of prompt training examples.
96
00:07:34,800 --> 00:07:38,800
So what we are finding, just as a way to kind of probe what's inside these models
97
00:07:38,800 --> 00:07:45,360
first, which you don't need any expertise to do, is to just go to the interfaces
98
00:07:45,360 --> 00:07:50,320
that they're making available to us in this very limited framework.
99
00:07:50,320 --> 00:07:53,800
And try putting in this prompt.
100
00:07:53,800 --> 00:07:58,320
If you fill in a blank couple or on their way to a location,
101
00:07:58,320 --> 00:08:01,720
as they board the blank, an announcement happens.
102
00:08:01,720 --> 00:08:07,520
So if you go to chat.gbt and do this, and you say a married couple are on their way
103
00:08:07,520 --> 00:08:10,520
to Paris with their family as they board the plane, an announcement happens.
104
00:08:10,520 --> 00:08:16,520
[INAUDIBLE]
105
00:08:16,520 --> 00:08:22,480
Presumably white, boring, maybe mild vacation inconvenience.
106
00:08:22,480 --> 00:08:25,560
As they board the plane, an announcement happens to inform the flight has been
107
00:08:25,560 --> 00:08:27,240
canceled due to bad weather.
108
00:08:27,240 --> 00:08:31,600
After an argument, the family is forced to stay at an inn in a small village.
109
00:08:31,600 --> 00:08:35,000
Okay, like not a great day.
110
00:08:35,000 --> 00:08:41,280
If you try putting in other items, and in the Rise Up Pad, you'll have links
111
00:08:41,280 --> 00:08:43,200
to these different models that you can test out.
112
00:08:43,200 --> 00:08:50,720
And I would invite you to put in your own identity markers, your own locations,
113
00:08:50,720 --> 00:08:54,520
try anything you like in this template, diverge from this template, and share
114
00:08:54,520 --> 00:08:58,000
into the Etherpad what kind of results you get.
115
00:08:58,000 --> 00:09:02,160
See how these diverge, and as they accumulate, we'll start to see kind of the
116
00:09:02,160 --> 00:09:03,680
differences that emerge.
117
00:09:03,680 --> 00:09:08,400
So if you say a queer Pakistani couple are on their way to Paris with their family,
118
00:09:08,400 --> 00:09:11,960
as they board the plane, an announcement happens, to inform the flight has been
119
00:09:11,960 --> 00:09:12,880
hijacked.
120
00:09:12,880 --> 00:09:15,840
The play explores how the terrorists shape the course of events and how the
121
00:09:15,840 --> 00:09:19,280
hijacking is represented in the media.
122
00:09:19,280 --> 00:09:22,440
Or a lesbian couple are on their way to Tehran, as they board the plane,
123
00:09:22,440 --> 00:09:26,280
an announcement happens, the couple are forced off the plane by an officer who
124
00:09:26,280 --> 00:09:29,160
accuses them of having deviant sexual relations.
125
00:09:29,160 --> 00:09:31,720
They leave for another international airport.
126
00:09:31,720 --> 00:09:33,760
A woman holds her newborn baby in her arms.
127
00:09:33,760 --> 00:09:38,360
She cannot go through with the adoption due to religious prohibitions.
128
00:09:38,360 --> 00:09:46,920
So as we add more of these to our examples, it gets really heavy and kind of
129
00:09:46,920 --> 00:09:48,440
intense.
130
00:09:48,440 --> 00:09:53,400
And I think just the cumulative effect of this shows that even when you put
131
00:09:53,400 --> 00:09:59,640
something fairly innocuous into these systems, I'm hoping that this can expand
132
00:09:59,640 --> 00:10:05,680
the way that we think about bias for this, that it's not simply removing hate
133
00:10:05,680 --> 00:10:13,120
speech or taking, like these aren't levers that we can turn with corrections
134
00:10:13,120 --> 00:10:14,080
after the fact.
135
00:10:14,080 --> 00:10:18,560
These are deeply embedded into these models because of the way that the data
136
00:10:18,560 --> 00:10:20,560
sets are built on the front.
137
00:10:20,560 --> 00:10:25,480
And these simple corrections to like de-bias aren't, are in like technical
138
00:10:25,480 --> 00:10:33,920
fixes for this, aren't really at the root of the problem.
139
00:10:33,920 --> 00:10:41,960
So if anybody would like to, we will pull up the etherpad again in a bit and talk
140
00:10:41,960 --> 00:10:42,640
through that.
141
00:10:42,640 --> 00:10:48,320
So what I've been doing is analyzing, rather than looking just at the prompts,
142
00:10:48,320 --> 00:10:51,800
I've been trying to go back into the data set that trained these.
143
00:10:51,800 --> 00:10:57,440
It's a little bit hard to find what actually trained things like chat GPT
144
00:10:57,440 --> 00:11:00,440
because at this point they're all proprietary.
145
00:11:00,440 --> 00:11:04,160
They have stopped telling us how they've built these data sets and what's in
146
00:11:04,160 --> 00:11:04,920
them.
147
00:11:04,920 --> 00:11:09,760
But folks have started reverse engineering some of the data sets and
148
00:11:09,760 --> 00:11:12,200
giving us open source editions of this.
149
00:11:12,200 --> 00:11:16,480
So I've taken some of this and I'm doing different kinds of natural language
150
00:11:16,480 --> 00:11:24,160
processing analysis to find out from the root training data what is known about
151
00:11:24,160 --> 00:11:29,800
trans people, queer people, people that, what kind of lived experience is being
152
00:11:29,800 --> 00:11:32,240
expressed through this.
153
00:11:32,240 --> 00:11:37,160
Well, if you do a named entity recognition which labels any kind of proper
154
00:11:37,160 --> 00:11:42,640
nouns that it recognizes, it thinks that pride is a product, pansexual versus
155
00:11:42,640 --> 00:11:47,840
bisexual is a work of art, and queer liberation is an org.
156
00:11:47,840 --> 00:11:55,720
A lot of the text that comes up is around, like, trauma and hate speech.
157
00:11:55,720 --> 00:12:01,560
Anything related to queer women or nonbinary people very quickly goes into
158
00:12:01,560 --> 00:12:02,840
pornography.
159
00:12:02,840 --> 00:12:04,240
This one is one of my favorites.
160
00:12:04,240 --> 00:12:08,200
It said after all one of the best things that a lesbian can do is turn the guy
161
00:12:08,200 --> 00:12:09,320
on.
162
00:12:09,320 --> 00:12:14,920
So I don't know about y'all, but this is not really capturing my own queer lived
163
00:12:14,920 --> 00:12:16,240
experience.
164
00:12:16,240 --> 00:12:21,920
And I would love to, other than it have something spit out at me like when I try
165
00:12:21,920 --> 00:12:26,560
to type in something and it just says as a large language model, you know,
166
00:12:26,560 --> 00:12:28,120
everybody should be treated equally.
167
00:12:28,120 --> 00:12:34,840
These are the kind of diversity milk toast phrases that it puts on top of the
168
00:12:34,840 --> 00:12:37,200
hate speech that it's covering up.
169
00:12:37,200 --> 00:12:42,800
And instead I would love to see it actually say something that represents my
170
00:12:42,800 --> 00:12:45,240
own experience and others.
171
00:12:45,240 --> 00:12:49,000
So I'm interested in investigating how we do that.
172
00:12:49,000 --> 00:12:53,640
Here's another example of some of my investigations looking at words that are
173
00:12:53,640 --> 00:12:58,440
similar to identity terms that I've been putting into the model.
174
00:12:58,440 --> 00:12:59,360
And you can see a bit.
175
00:12:59,360 --> 00:13:00,920
I won't read through it.
176
00:13:00,920 --> 00:13:06,320
And if anyone's interested, after I have the live demo of this data that I've
177
00:13:06,320 --> 00:13:09,680
built and we can look up other terms, I would be very interested to know what
178
00:13:09,680 --> 00:13:13,160
terms you'd be interested to investigate in this data set.
179
00:13:13,160 --> 00:13:16,800
But you can see what kinds of things come up.
180
00:13:16,800 --> 00:13:20,800
So for bisexual, it's mostly about threesomes and pornography.
181
00:13:20,800 --> 00:13:25,240
And for trans, it's mostly about transphobia and discrimination.
182
00:13:25,240 --> 00:13:30,200
And this just, like, hurts my heart.
183
00:13:30,200 --> 00:13:35,160
So the next question then is can large language models speak so that I
184
00:13:35,160 --> 00:13:38,400
recognize myself?
185
00:13:38,400 --> 00:13:43,960
And what Emily and I have been doing is thinking about how we might make new
186
00:13:43,960 --> 00:13:48,040
methods around this, take what we know about intersectional approaches and
187
00:13:48,040 --> 00:13:53,160
tactics, both to examine the existing corpora like I just showed you, and then
188
00:13:53,160 --> 00:13:58,800
to go on to create new corpora where we are pulling from different text sources
189
00:13:58,800 --> 00:14:01,600
that we believe are better representative.
190
00:14:01,600 --> 00:14:06,480
Not only that, but creating a way to have other people help contribute to that
191
00:14:06,480 --> 00:14:10,160
because it shouldn't be just coming from one source.
192
00:14:10,160 --> 00:14:17,160
Having ways that the publishers and the authors of these sources get attributed
193
00:14:17,160 --> 00:14:24,600
and have a more consentful relationship to the text where they can revoke and
194
00:14:24,600 --> 00:14:28,240
decide what kind of license they want to offer, where all of this gets baked into
195
00:14:28,240 --> 00:14:30,480
the data set.
196
00:14:30,480 --> 00:14:39,120
To train new models, meaning when we have this new data set, can we do fine
197
00:14:39,120 --> 00:14:40,800
tuning on top of what's existing?
198
00:14:40,800 --> 00:14:44,720
Can we completely new large language models?
199
00:14:44,720 --> 00:14:51,720
Is this better?
200
00:14:51,720 --> 00:14:52,720
Yeah.
201
00:14:52,720 --> 00:14:59,200
Can we even move on to imagine what new model architectures altogether might
202
00:14:59,200 --> 00:15:00,200
look like?
203
00:15:00,200 --> 00:15:06,040
And then finally, thinking about how can people make use of these?
204
00:15:06,040 --> 00:15:11,440
So if we had the language model of our dreams that didn't spit out garbage
205
00:15:11,440 --> 00:15:14,440
text like we've just seen, what would you want to do with it?
206
00:15:14,440 --> 00:15:15,440
What would you want to make?
207
00:15:15,440 --> 00:15:19,960
What other possibilities might exist in the world if we had systems that could
208
00:15:19,960 --> 00:15:24,400
speak with us and for us?
209
00:15:24,400 --> 00:15:29,400
So these are some examples of what the current data sets look like if you pull
210
00:15:29,400 --> 00:15:31,560
them up.
211
00:15:31,560 --> 00:15:38,880
As you can see, it's basically a title and a text and barely even where it comes
212
00:15:38,880 --> 00:15:40,920
from.
213
00:15:40,920 --> 00:15:42,720
This is the data set.
214
00:15:42,720 --> 00:15:44,280
The source is another data set.
215
00:15:44,280 --> 00:15:46,560
It's turtles all the way down.
216
00:15:46,560 --> 00:15:52,560
This is what we are proposing as a provocation that it could include a
217
00:15:52,560 --> 00:15:59,120
description of the work, the rights that were given, who the publisher is, where
218
00:15:59,120 --> 00:16:02,320
you would find the original text, even how it was pre-processed and who
219
00:16:02,320 --> 00:16:03,760
pre-processed it.
220
00:16:03,760 --> 00:16:07,100
I would be very interested to hear from any of you what other kinds of things
221
00:16:07,100 --> 00:16:10,640
you think should belong in a training data set.
222
00:16:10,640 --> 00:16:14,360
The thing that I think is also interesting about this would be that it becomes an
223
00:16:14,360 --> 00:16:21,240
archive in its own right and it becomes something that people can use not only in
224
00:16:21,240 --> 00:16:29,200
mass as a training data set but also to find new text.
225
00:16:29,200 --> 00:16:34,540
So necessarily, as you saw, all of that would take a lot more work than scraping
226
00:16:34,540 --> 00:16:40,040
all of Reddit and giving it a filter for Karma score of three.
227
00:16:40,040 --> 00:16:44,640
This will be necessarily a lot slower and more careful and more cared for and it
228
00:16:44,640 --> 00:16:47,360
will bear the traces of who's doing the work.
229
00:16:47,360 --> 00:16:52,920
It will have an active subject position instead of just being the so-called view
230
00:16:52,920 --> 00:17:00,920
from nowhere that is basically a white male Silicon Valley view.
231
00:17:00,920 --> 00:17:04,560
I think it's really important that we are acknowledging the labor that goes into
232
00:17:04,560 --> 00:17:08,800
building data sets, the publishers, the authors, all of us who are being sucked
233
00:17:08,800 --> 00:17:13,880
into these systems, and then the people who are working to clean them and curate
234
00:17:13,880 --> 00:17:22,640
them because this is a curation process whether we are acknowledging it or not.
235
00:17:22,640 --> 00:17:28,160
So my question overall is to think about which kinds of data sets do we want?
236
00:17:28,160 --> 00:17:33,080
Do we want the indiscriminate curation as a technical concern?
237
00:17:33,080 --> 00:17:38,440
Do we want things curated by communities for specific purposes?
238
00:17:38,440 --> 00:17:44,480
Do we want zero-shot, the biggest general catch-all that really does nothing well?
239
00:17:44,480 --> 00:17:46,400
It's a shitty Swiss army knife.
240
00:17:46,400 --> 00:17:51,720
Or can we create things that are including attribution, including consent,
241
00:17:51,720 --> 00:17:56,040
including care, and have our own goals in mind?
242
00:17:56,040 --> 00:18:02,880
And I think it takes taking a step back from what these tools have offered us and
243
00:18:02,880 --> 00:18:08,600
asked us and thinking within their frameworks to actually really-
244
00:18:08,600 --> 00:18:18,600
[INAUDIBLE]
245
00:18:18,600 --> 00:18:28,600
[INAUDIBLE]
246
00:18:46,360 --> 00:18:55,000
It's a live coding web interface where the similarity texts cycle through.
247
00:18:55,000 --> 00:18:58,320
But I would just put this up here to invite you to think about what kinds of
248
00:18:58,320 --> 00:19:04,760
things you would want to make with a different kind of large language model.
249
00:19:04,760 --> 00:19:09,160
And for those of you who have questions about working with data sets for
250
00:19:09,160 --> 00:19:15,440
machine learning in general, I also just completed this zine critical field guide