How to raise an AI model in today's world

slamminsammya · Oct 7, 2023

I'd guess this is because the more often a model sees something the more likely it is to "learn" it, which will be true regardless of the type of model or the size of the model. Long tails are actually not so hard to learn if you have a sufficiently large number of examples, even when those represent a small portion of the training data. But I am not an LLM expert.

singaporesoxfan · Oct 8, 2023

We're used to thinking in many contexts that the broader a sample you draw from, the more representative the sample becomes of the population you're trying to sample. But I don't know if that underlying assumption is true of large training datasets from the Internet: it might be the most prolific sources of the data are dominated by people who hold these biases.

AlNipper49 · Oct 8, 2023

Training of AI is a bigger thing than actual AI. Right now the AI is trained by humans, which mostly takes the I out of AI.

AI right now is basically google on steroids. It’ll become way better sometime in the future.

pokey_reese · Oct 8, 2023

Having finally had time to read through the paper, there are a lot of issues that I would take with both their methodology and conclusions, to be honest. It's not that I think that there is no validity to their thesis (though I did off the bat have at least some skepticism there), but the fact that it is based on a sample-of-a-sample-of-a-sample, and that their sampling isn't bootstrapped with replacement, it feels like a lot of room is left for a repeated attempt to fail to reproduce their results even on a basic level.

pokey_reese · Oct 9, 2023

So, to expand a little on my thoughts in no particular order, since you at-mentioned me (and thus asked for more words, no matter how much you may regret it):

- Generalizability is a key to most of these LLM models, and more importantly a specific function of the modeling training process regardless of the flavor or topic. What that means is that with few exceptions, adding more observations of a minority class (non-dominant data points in terms of whatever you are trying to predict), will not cause the model to suddenly start predicting those rare things more often (if you are doing it right, at least), unless they were under represented in your original training set.

- As far as the bootstrapping thing I mentioned, this is related to the fact that most modeling is based on samples of data, and what gets pulled into your sample can bias the outcome substantially. A note here, but this concept of sampling is important and easy to misunderstand in this context, because of the fact that they are dealing with text-image pairs, which have more variance between observations, and a wider range of possible outcomes, than most models. All of which to say, you have a better chance of getting a non-representative sample (of the not selected observations) by picking random data points. A common solution to that is to instead do many repeated samples, and take aggregated values, also known as bootstrapping, in order to reduce sample bias and generalize better to out-of-sample data points.

- This is important because they are actually evaluating multiple things here, one of which is the prevalence of 'hateful/targeted/aggressive' text strings in their sample, which is going to be, by nature, subject to a great deal of random variation (without even digging into any possible issues with the method that they are using to label those). They do acknowledge the potential issue of sampling per file, and address it in section 8, but simply establish that the per-file metrics match those of the overall distribution, not that their sampling method from those files avoids bias.

- Moreover, they don't actually seem to be saying that scaling itself inherently leads to this particular phenomenon, but simply stating an observation about these two particular data sets ( LAION-400M and LAION-2B-en), which are specific samples from the CommonCrawl data set. It isn't convincing that the effect that they are studying is caused by scaling, or by some factor unique to this one public data set, which they also note early on isn't being used to train the most high-profile AI models (which tend to have private, curated training sets).

There is a lot more to dig into here, this really only covers up through the 'data set audit' without getting into the 'model audit,' but it still gives us a lot to think about, and for me personally, it gives vibes of having come up with a hypothesis and then worked to support it, rather than a neutral observation and inquiry. I'm not a blanket AI apologist, and I think issues of bias and racism in machine learning are real and serious, but this particular paper just doesn't quite get there for me, at least relative to the title/abstract.

AlNipper49 · Oct 10, 2023

sezwho said:
Thank you for taking the deeper dive: I regret nothing! I’ve read a couple times and it’s starting to make sense.

I like this analogy, which is interesting because google search is itself also basically AI on steroids (or at least deployed on massive scale) too, no?

Relative to bolded, I’m also curious about the need for handoff between these LLMs, which are really good at detecting general context, and the specific agent trained to the task, then back to the LLM for your next task, and so on. Like LLM is good at knowing you need a car insurance claim for accident and get some basics, but a dedicated/trained bot needs to be engaged to process the claim most efficiently. Not every step can/will be LLM for some time I think.

VUX podcast fromAndrei Papancea CEO and Co-Founder of NLX does better job walking through this piece…

I think that what we see now is pretty typical of the technology lifecycle curve. We have a basic model, let's call it an Ask Jeeves model. The only folks using it are those who absolutely need to do because of business need (sinking ship) or they are a startup type needing to differentiate a product and willing to take on early adopter risk. As real companies and entities *really* start to use it I think you'll see a ton more go into training LLMs and the actual LLMs themselves. Right now they're really just a neural network on steroids. As you say, you'll have APIs/Comms layers between LLMs, for larger implementations you'll have specialized LLMs working on specific things. Eventually you'll get the googles coming out and it'll evolve quickly beyond that, mostly because it's development will be aided by AI itself.

I'm not really shitting on it, it's pretty disruptive for certain things already. If I was a 60 year old programmer cutting and pasting COBOL from the same part of the brain every day into the same legacy system then you're job will be pretty much be gone by the time you are ready to retire. If you're a copywriter who is good with words but doesn't do much beyond that, you're fucked. The prime downside is that it is a technology that lends itself more to monopolization than most things. There is very, very, very little near/mid term risk to a SkyNet thing happening in any fashion.

pokey_reese · Oct 11, 2023

AlNipper49 said:
I think that what we see now is pretty typical of the technology lifecycle curve. We have a basic model, let's call it an Ask Jeeves model. The only folks using it are those who absolutely need to do because of business need (sinking ship) or they are a startup type needing to differentiate a product and willing to take on early adopter risk. As real companies and entities *really* start to use it I think you'll see a ton more go into training LLMs and the actual LLMs themselves. Right now they're really just a neural network on steroids. As you say, you'll have APIs/Comms layers between LLMs, for larger implementations you'll have specialized LLMs working on specific things. Eventually you'll get the googles coming out and it'll evolve quickly beyond that, mostly because it's development will be aided by AI itself.

I'm not really shitting on it, it's pretty disruptive for certain things already. If I was a 60 year old programmer cutting and pasting COBOL from the same part of the brain every day into the same legacy system then you're job will be pretty much be gone by the time you are ready to retire. If you're a copywriter who is good with words but doesn't do much beyond that, you're fucked. The prime downside is that it is a technology that lends itself more to monopolization than most things. There is very, very, very little near/mid term risk to a SkyNet thing happening in any fashion.

I was skeptical at first, but count me among those who have started to be convinced that it is even a bit more disruptive than that, at least in the dev space. Anecdotal obviously, but I have basically 'outsourced' a number of projects to ChatGPT this year that would have previously been handled by a paid intern or entry level programmer, and gotten better results faster than I think I would have with a new human hire.

But, that's admittedly with an experienced person driving the bus. I think what has really convinced me significantly more than my own experience, is watching several people at my company who have lived and died by Excel, but never had any coding experience, use chatGPT to help write automation scripts to do portions of their work more efficiently. The good thing is, they aren't replacing themselves, but they are freeing up hours per week to do other work, but realistically, without this technology we probably would have hired one or two more people this year than we did, as a company. Not saying it is good or bad, but it's definitely happening.

As you note, I work at a startup, but it's at about 50 people, not 5, so this isn't just a bootstrapping thing.

axx · Nov 14, 2023

All I can think about is the Practice rant when anyone mentions AI.

(You know who could use some Practice... the Patriots)

TomRicardo · Nov 28, 2023

AlNipper49 said:
Training of AI is a bigger thing than actual AI. Right now the AI is trained by humans, which mostly takes the I out of AI.

AI right now is basically google on steroids. It’ll become way better sometime in the future.

I would say it is like a digital intern. I don't think it will get much better any time soon since people have never treated data hygiene as important.

Search

Search

How to raise an AI model in today's world

sezwho

Member

slamminsammya

Member

singaporesoxfan

Well-Known Member

AlNipper49

Huge Member

pokey_reese

Member

sezwho

Member

pokey_reese

Member

sezwho

Member

AlNipper49

Huge Member

pokey_reese

Member

sezwho

Member

sezwho

Member

axx

Member

sezwho

Member

TomRicardo

rusty cohlebone