How to raise an AI model in today's world

sezwho

Member
SoSH Member
Jul 20, 2005
2,028
Isle of Plum
Hello World, coming up on 20 years but I think this is my first thread so be gentle : )

I work in a world where LLMs and what they are trained on is important and had this reference shared by a lead developer: https://arxiv.org/abs/2306.13141

Some still closer to the metal (I lost my developer fastball years ago) may be better able to parse, but what I've extracted is that the broader of a sample you draw from (publicly available data) the more biased/hateful your AI becomes.

This was interesting enough on its own (to me?) that I thought some of you would be interested, but I was also curious if anyone could help suss out the why. Maybe its an 'organic' process and people are just ass-hats? Maybe people are mezza mezza but become monsters when on the internet (leader in clubhouse : ). Its also got me thinking that perhaps hate groups are organized in application of AI on the internet for hate content?
 

slamminsammya

Member
SoSH Member
Jul 31, 2006
9,511
San Francisco
I'd guess this is because the more often a model sees something the more likely it is to "learn" it, which will be true regardless of the type of model or the size of the model. Long tails are actually not so hard to learn if you have a sufficiently large number of examples, even when those represent a small portion of the training data. But I am not an LLM expert.
 

singaporesoxfan

Well-Known Member
Lifetime Member
SoSH Member
Jul 21, 2004
11,890
Washington, DC
We're used to thinking in many contexts that the broader a sample you draw from, the more representative the sample becomes of the population you're trying to sample. But I don't know if that underlying assumption is true of large training datasets from the Internet: it might be the most prolific sources of the data are dominated by people who hold these biases.
 

AlNipper49

Huge Member
Dope
SoSH Member
Apr 3, 2001
44,931
Mtigawi
Training of AI is a bigger thing than actual AI. Right now the AI is trained by humans, which mostly takes the I out of AI.

AI right now is basically google on steroids. It’ll become way better sometime in the future.
 

pokey_reese

Member
SoSH Member
Jun 25, 2008
16,321
Boston, MA
Having finally had time to read through the paper, there are a lot of issues that I would take with both their methodology and conclusions, to be honest. It's not that I think that there is no validity to their thesis (though I did off the bat have at least some skepticism there), but the fact that it is based on a sample-of-a-sample-of-a-sample, and that their sampling isn't bootstrapped with replacement, it feels like a lot of room is left for a repeated attempt to fail to reproduce their results even on a basic level.
 

sezwho

Member
SoSH Member
Jul 20, 2005
2,028
Isle of Plum
Thanks to those who took the time to reply.

Ultimately, I suppose perhaps some amount of gigo is going to apply, just faster with AI.

@pokey_reese I’ll do a little googling and try to bootstrap my own knowledge a bit : ) to better understand the implications of your observations on how they sampled.
 

pokey_reese

Member
SoSH Member
Jun 25, 2008
16,321
Boston, MA
So, to expand a little on my thoughts in no particular order, since you at-mentioned me (and thus asked for more words, no matter how much you may regret it):

- Generalizability is a key to most of these LLM models, and more importantly a specific function of the modeling training process regardless of the flavor or topic. What that means is that with few exceptions, adding more observations of a minority class (non-dominant data points in terms of whatever you are trying to predict), will not cause the model to suddenly start predicting those rare things more often (if you are doing it right, at least), unless they were under represented in your original training set.

- As far as the bootstrapping thing I mentioned, this is related to the fact that most modeling is based on samples of data, and what gets pulled into your sample can bias the outcome substantially. A note here, but this concept of sampling is important and easy to misunderstand in this context, because of the fact that they are dealing with text-image pairs, which have more variance between observations, and a wider range of possible outcomes, than most models. All of which to say, you have a better chance of getting a non-representative sample (of the not selected observations) by picking random data points. A common solution to that is to instead do many repeated samples, and take aggregated values, also known as bootstrapping, in order to reduce sample bias and generalize better to out-of-sample data points.

- This is important because they are actually evaluating multiple things here, one of which is the prevalence of 'hateful/targeted/aggressive' text strings in their sample, which is going to be, by nature, subject to a great deal of random variation (without even digging into any possible issues with the method that they are using to label those). They do acknowledge the potential issue of sampling per file, and address it in section 8, but simply establish that the per-file metrics match those of the overall distribution, not that their sampling method from those files avoids bias.

- Moreover, they don't actually seem to be saying that scaling itself inherently leads to this particular phenomenon, but simply stating an observation about these two particular data sets ( LAION-400M and LAION-2B-en), which are specific samples from the CommonCrawl data set. It isn't convincing that the effect that they are studying is caused by scaling, or by some factor unique to this one public data set, which they also note early on isn't being used to train the most high-profile AI models (which tend to have private, curated training sets).

There is a lot more to dig into here, this really only covers up through the 'data set audit' without getting into the 'model audit,' but it still gives us a lot to think about, and for me personally, it gives vibes of having come up with a hypothesis and then worked to support it, rather than a neutral observation and inquiry. I'm not a blanket AI apologist, and I think issues of bias and racism in machine learning are real and serious, but this particular paper just doesn't quite get there for me, at least relative to the title/abstract.
 

sezwho

Member
SoSH Member
Jul 20, 2005
2,028
Isle of Plum
So, to expand a little on my thoughts in no particular order, since you at-mentioned me (and thus asked for more words, no matter how much you may regret it):

- Generalizability is a key to most of these LLM models, and more importantly a specific function of the modeling training process regardless of the flavor or topic. What that means is that with few exceptions, adding more observations of a minority class (non-dominant data points in terms of whatever you are trying to predict), will not cause the model to suddenly start predicting those rare things more often (if you are doing it right, at least), unless they were under represented in your original training set.

- As far as the bootstrapping thing I mentioned, this is related to the fact that most modeling is based on samples of data, and what gets pulled into your sample can bias the outcome substantially. A note here, but this concept of sampling is important and easy to misunderstand in this context, because of the fact that they are dealing with text-image pairs, which have more variance between observations, and a wider range of possible outcomes, than most models. All of which to say, you have a better chance of getting a non-representative sample (of the not selected observations) by picking random data points. A common solution to that is to instead do many repeated samples, and take aggregated values, also known as bootstrapping, in order to reduce sample bias and generalize better to out-of-sample data points.

- This is important because they are actually evaluating multiple things here, one of which is the prevalence of 'hateful/targeted/aggressive' text strings in their sample, which is going to be, by nature, subject to a great deal of random variation (without even digging into any possible issues with the method that they are using to label those). They do acknowledge the potential issue of sampling per file, and address it in section 8, but simply establish that the per-file metrics match those of the overall distribution, not that their sampling method from those files avoids bias.

- Moreover, they don't actually seem to be saying that scaling itself inherently leads to this particular phenomenon, but simply stating an observation about these two particular data sets ( LAION-400M and LAION-2B-en), which are specific samples from the CommonCrawl data set. It isn't convincing that the effect that they are studying is caused by scaling, or by some factor unique to this one public data set, which they also note early on isn't being used to train the most high-profile AI models (which tend to have private, curated training sets).

There is a lot more to dig into here, this really only covers up through the 'data set audit' without getting into the 'model audit,' but it still gives us a lot to think about, and for me personally, it gives vibes of having come up with a hypothesis and then worked to support it, rather than a neutral observation and inquiry. I'm not a blanket AI apologist, and I think issues of bias and racism in machine learning are real and serious, but this particular paper just doesn't quite get there for me, at least relative to the title/abstract.
Thank you for taking the deeper dive: I regret nothing! I’ve read a couple times and it’s starting to make sense.


Training of AI is a bigger thing than actual AI. Right now the AI is trained by humans, which mostly takes the I out of AI.

AI right now is basically google on steroids. It’ll become way better sometime in the future.
I like this analogy, which is interesting because google search is itself also basically AI on steroids (or at least deployed on massive scale) too, no?

Relative to bolded, I’m also curious about the need for handoff between these LLMs, which are really good at detecting general context, and the specific agent trained to the task, then back to the LLM for your next task, and so on. Like LLM is good at knowing you need a car insurance claim for accident and get some basics, but a dedicated/trained bot needs to be engaged to process the claim most efficiently. Not every step can/will be LLM for some time I think.

VUX podcast fromAndrei Papancea CEO and Co-Founder of NLX does better job walking through this piece…
 
Last edited:

AlNipper49

Huge Member
Dope
SoSH Member
Apr 3, 2001
44,931
Mtigawi
Thank you for taking the deeper dive: I regret nothing! I’ve read a couple times and it’s starting to make sense.




I like this analogy, which is interesting because google search is itself also basically AI on steroids (or at least deployed on massive scale) too, no?

Relative to bolded, I’m also curious about the need for handoff between these LLMs, which are really good at detecting general context, and the specific agent trained to the task, then back to the LLM for your next task, and so on. Like LLM is good at knowing you need a car insurance claim for accident and get some basics, but a dedicated/trained bot needs to be engaged to process the claim most efficiently. Not every step can/will be LLM for some time I think.

VUX podcast fromAndrei Papancea CEO and Co-Founder of NLX does better job walking through this piece…
I think that what we see now is pretty typical of the technology lifecycle curve. We have a basic model, let's call it an Ask Jeeves model. The only folks using it are those who absolutely need to do because of business need (sinking ship) or they are a startup type needing to differentiate a product and willing to take on early adopter risk. As real companies and entities *really* start to use it I think you'll see a ton more go into training LLMs and the actual LLMs themselves. Right now they're really just a neural network on steroids. As you say, you'll have APIs/Comms layers between LLMs, for larger implementations you'll have specialized LLMs working on specific things. Eventually you'll get the googles coming out and it'll evolve quickly beyond that, mostly because it's development will be aided by AI itself.

I'm not really shitting on it, it's pretty disruptive for certain things already. If I was a 60 year old programmer cutting and pasting COBOL from the same part of the brain every day into the same legacy system then you're job will be pretty much be gone by the time you are ready to retire. If you're a copywriter who is good with words but doesn't do much beyond that, you're fucked. The prime downside is that it is a technology that lends itself more to monopolization than most things. There is very, very, very little near/mid term risk to a SkyNet thing happening in any fashion.
 

pokey_reese

Member
SoSH Member
Jun 25, 2008
16,321
Boston, MA
I think that what we see now is pretty typical of the technology lifecycle curve. We have a basic model, let's call it an Ask Jeeves model. The only folks using it are those who absolutely need to do because of business need (sinking ship) or they are a startup type needing to differentiate a product and willing to take on early adopter risk. As real companies and entities *really* start to use it I think you'll see a ton more go into training LLMs and the actual LLMs themselves. Right now they're really just a neural network on steroids. As you say, you'll have APIs/Comms layers between LLMs, for larger implementations you'll have specialized LLMs working on specific things. Eventually you'll get the googles coming out and it'll evolve quickly beyond that, mostly because it's development will be aided by AI itself.

I'm not really shitting on it, it's pretty disruptive for certain things already. If I was a 60 year old programmer cutting and pasting COBOL from the same part of the brain every day into the same legacy system then you're job will be pretty much be gone by the time you are ready to retire. If you're a copywriter who is good with words but doesn't do much beyond that, you're fucked. The prime downside is that it is a technology that lends itself more to monopolization than most things. There is very, very, very little near/mid term risk to a SkyNet thing happening in any fashion.
I was skeptical at first, but count me among those who have started to be convinced that it is even a bit more disruptive than that, at least in the dev space. Anecdotal obviously, but I have basically 'outsourced' a number of projects to ChatGPT this year that would have previously been handled by a paid intern or entry level programmer, and gotten better results faster than I think I would have with a new human hire.

But, that's admittedly with an experienced person driving the bus. I think what has really convinced me significantly more than my own experience, is watching several people at my company who have lived and died by Excel, but never had any coding experience, use chatGPT to help write automation scripts to do portions of their work more efficiently. The good thing is, they aren't replacing themselves, but they are freeing up hours per week to do other work, but realistically, without this technology we probably would have hired one or two more people this year than we did, as a company. Not saying it is good or bad, but it's definitely happening.

As you note, I work at a startup, but it's at about 50 people, not 5, so this isn't just a bootstrapping thing.
 

sezwho

Member
SoSH Member
Jul 20, 2005
2,028
Isle of Plum
I was skeptical at first, but count me among those who have started to be convinced that it is even a bit more disruptive than that, at least in the dev space. Anecdotal obviously, but I have basically 'outsourced' a number of projects to ChatGPT this year that would have previously been handled by a paid intern or entry level programmer, and gotten better results faster than I think I would have with a new human hire.

But, that's admittedly with an experienced person driving the bus. I think what has really convinced me significantly more than my own experience, is watching several people at my company who have lived and died by Excel, but never had any coding experience, use chatGPT to help write automation scripts to do portions of their work more efficiently. The good thing is, they aren't replacing themselves, but they are freeing up hours per week to do other work, but realistically, without this technology we probably would have hired one or two more people this year than we did, as a company. Not saying it is good or bad, but it's definitely happening.

As you note, I work at a startup, but it's at about 50 people, not 5, so this isn't just a bootstrapping thing.
I’m echoing this as well. I have highly skilled peers working at much larger companies that also have chatgpt write drivers and implement smaller applications on occasion. It’s much faster than my friends, Needed some tuning (maybe even resulting from incomplete design instructions?) but compiled first time. Code writing code is not a place I thought we’d be right now, and I worked on several DARPA projects in years past where code generation was a goal (or significant component of project).
 

axx

Member
SoSH Member
Jul 16, 2005
8,141
All I can think about is the Practice rant when anyone mentions AI.

(You know who could use some Practice... the Patriots)
 

TomRicardo

rusty cohlebone
Lifetime Member
SoSH Member
Feb 6, 2006
20,721
Row 14
Training of AI is a bigger thing than actual AI. Right now the AI is trained by humans, which mostly takes the I out of AI.

AI right now is basically google on steroids. It’ll become way better sometime in the future.
I would say it is like a digital intern. I don't think it will get much better any time soon since people have never treated data hygiene as important.