Yo! You're not logged in. Why am I seeing this ad?
Building a database for a website
#1
Posted 20 December 2012 - 04:00 PM
How complex is the database that underpins Baseball-Reference (or the like)?
Is it difficult to build a similar type of database that allows users to search across items such as players, teams, seasons etc and offers pre-canned reports (e.g. "OPS+ leaders")?
What if you wanted to offer permissioned users access to add data to this database either via a browser or a mobile device application?
How involved a project is this?
What are my considerations for hosting data?
Finally, were I to outsource this project to someone with expertise, what should I be aware of in terms of background etc?
Note that I simply used BRef because everyone here is familiar with it and I would be looking to present users with data in a similar format. I am trying to recreate their "wheel".
Any thoughts/help are greatly appreciated.
Thank You,
DBMH
#2
Posted 20 December 2012 - 04:57 PM
Unless you had some odd/complex features you aren't mentioning, this is easy and pretty close to what most web developers do every day. Hosting might be difficult depending on number of users and some other factors, but that's mostly a "throw money at it" problem.
#3
Posted 20 December 2012 - 06:08 PM
The database isn't particularly complex; it's mostly precomputed data. (A database closer to their game log would be trickier, both in terms of size and the analytics you'd need to do to get meaningful stuff.) It isn't a small database, though. An application would similarly not be a huge thing to do; it'd just be a (fairly) thin interface to a bunch of read-only SQL queries.
Unless you had some odd/complex features you aren't mentioning, this is easy and pretty close to what most web developers do every day. Hosting might be difficult depending on number of users and some other factors, but that's mostly a "throw money at it" problem.
Very helpful, thank you.
#4
Posted 20 December 2012 - 06:18 PM
#5
Posted 21 December 2012 - 01:57 PM
One more question. How much would something like this, plus a mobile app to access it, typically cost to both build and then maintain? A rough estimate is greatly appreciated.
Thanks again.
#6
Posted 21 December 2012 - 02:45 PM
Baseball Reference also gets a data feed from MLBAM that is typically only available to teams. They've also bought data from BIS in the past, I believe.
Also: you are totally insane and I would not do this. But, more power to you! =)
#7
Posted 21 December 2012 - 03:01 PM
Presuming you actually want to recreate Baseball Reference, and not just a website like Baseball Reference in some other domain...
Baseball Reference also gets a data feed from MLBAM that is typically only available to teams. They've also bought data from BIS in the past, I believe.
Also: you are totally insane and I would not do this. But, more power to you! =)
To be clear, I do not want to recreate BRef at all. I have an idea that I am exploring that entails collecting data that is similar in nature to the data BRef (and the like) collects and displays. I don't see any value proposition whatsoever in competing with what they already do exceptionally well.
#8
Posted 21 December 2012 - 03:49 PM
This is not to say that B-Ref didn't add a whole bunch of things and aggregate and collect a lot of additional data, but just to say that a bunch of the work had already been done for them, which simplified the equation quite a bit.
#9
Posted 21 December 2012 - 09:39 PM
Thank you Otto.
One more question. How much would something like this, plus a mobile app to access it, typically cost to both build and then maintain? A rough estimate is greatly appreciated.
Thanks again.
A little if you don't care about it much, a lot if you care about it and want it to work.
#10
Posted 21 December 2012 - 10:34 PM
I do this stuff professionally. If you want an estimate, feel free to PM.
#11
Posted 22 December 2012 - 05:44 PM
Look at SOLR or possibly ElasticSearch, they are free but not quite so simple.
If you're very enterprising ElasticSearch acts as data storage and a search engine so you could skip the database entirely.
#12
Posted 22 December 2012 - 06:50 PM
Databases are simple, free, and do not scale with increasing traffic. What you need is a search engine.
Look at SOLR or possibly ElasticSearch, they are free but not quite so simple.
1. Databases range from simple (structured flat files or csv stores or simple AVP stores like BSD-DB) to complex (Enterprise SQL RDBMSs like Oracle, NoSQL databases like MongoDB, or niche target systems like Lexis Nexis' HPCC
2. Database systems range from free to very expensive.
3. Many databases are fully distributed and can scale to meet large traffic demands.
4. Most search engines, and all efficient search engines, are backed by a database; you really can't have one without the other. (e.g. SOLR and ElasticSearch both have fairly complex backing databases with Lucene-generated inverted indexes, which are very similar to the B-trees that Postgres or MySql indexing uses, proximity data stores, term-value attribute-value stores, etc.). Full-text search engines are essentially very specialized front ends to their backing databases
5. Most search engines, including the two you mention, are much simpler than even relatively unsophisticated SQL RDBMSs like mysql--there's a whole lot that goes into making, say, an ACID-compliant data store that you don't need for a search-only system. And the SQL standard itself is relatively cumbersome.
This project will almost certainly need some sort of search engine (that's almost a given for any sizable web site these days), but the data store needs are obviously more than just that.
#13
Posted 26 December 2012 - 09:31 AM
#14
Posted 27 December 2012 - 06:25 AM
I assumed the poster was talking about a standard Relational Database which most people associated with a database. Nobody is going to run a website using CSV files, so why even bring it up? MySQL is free and relatively simple to set up. Mongo, Redis, Couch, are also free NoSQL DBs. There is no reason to pay for an expensive DB system, there are plenty of reliable free alternatives.1. Databases range from simple (structured flat files or csv stores or simple AVP stores like BSD-DB) to complex (Enterprise SQL RDBMSs like Oracle, NoSQL databases like MongoDB, or niche target systems like Lexis Nexis' HPCC
2. Database systems range from free to very expensive.
Not for something the scale of Baseball Reference which probably includes a lot of full text searches. These DO NOT cost effectively scale for these types of searches. One SOLR node could handle the traffic of a cluster of slaves handling similar queries with greater speed and accuracy.3. Many databases are fully distributed and can scale to meet large traffic demands.
Not true, ElasticSearch does NOT require a back end database. Graylog2 replaced MongoDB with ElasticSearch for it's storage. I use Graylog2 and I haven't had any problems with data loss. Granted they are only log files and if they got lost it wouldn't be a big deal. SOLR 4 also doesn't require an external data source. SOLR Cloud maintains it's own transaction log and can rebuild the indexes from that. Is it trustworthy? Pay someone to find out, but it's possible.4. Most search engines, and all efficient search engines, are backed by a database; you really can't have one without the other. (e.g. SOLR and ElasticSearch both have fairly complex backing databases with Lucene-generated inverted indexes, which are very similar to the B-trees that Postgres or MySql indexing uses, proximity data stores, term-value attribute-value stores, etc.). Full-text search engines are essentially very specialized front ends to their backing databases
I'm glad you agree. In my opinion, a site like the poster was inquiring about should be heavily driven by search. Data storage is fungible, that is why there are so many different types for different needs. I could do an analysis and choose the best type, but I'd want to be paid for it.5. Most search engines, including the two you mention, are much simpler than even relatively unsophisticated SQL RDBMSs like mysql--there's a whole lot that goes into making, say, an ACID-compliant data store that you don't need for a search-only system. And the SQL standard itself is relatively cumbersome.
This project will almost certainly need some sort of search engine (that's almost a given for any sizable web site these days), but the data store needs are obviously more than just that.
#15
Posted 27 December 2012 - 06:48 AM
Quick, cost-effective and designed well. Pick two.
#16
Posted 27 December 2012 - 12:21 PM
Again, to be clear, I am not considering recreating B-Ref in any way, shape or form, however that is a very close comp to what I'd like to build. Furthermore, my plan would be to cover more than one dataset or vertical like they do (e.g. baseball, basketball, pro football, basketball etc) and it would include similar types of data. To do it right, I would almost certainly need to have it set up to scale to the amount of data they have captured (if not larger).
I appreciate the color here. I am still in the conceptual stage with this project and want to make sure that if I move ahead, I am doing it the right way and that I account for the cost of doing so beforehand.
Edited by DeJesus Built My Hotrod, 27 December 2012 - 12:22 PM.
#17
Posted 29 December 2012 - 06:13 PM
I assumed the poster was talking about a standard Relational Database which most people associated with a database.
"Standard Relational Database" is a bad assumption--plenty of web sites use other database systems for very good reasons, and getting someone thinking that narrowly right at the outset isn't a good idea.
Nobody is going to run a website using CSV files, so why even bring it up? MySQL is free and relatively simple to set up. Mongo, Redis, Couch, are also free NoSQL DBs. There is no reason to pay for an expensive DB system, there are plenty of reliable free alternatives.
I even mentioned Mongo and Mysql in my post.
Your post was full of nonsensical and false assertions, which I was attempting to address with specific examples--when someone comes asking for advice about a topic that they're not sure about, the worst thing you can do is start spewing out nonsense along the lines of "databases are simple", "databases don't scale", "databases are free", or "SOLR doesn't use a database". Yet you not only did so, but came back and repeated a lot of the same nonsense when you were corrected on it.
It's important to get this right and to be precise in language, especially when talking to someone who's looking for advice because he's not an expert in the field. On a site of this type, data is almost everything--if you screw up the code, that's usually relatively easy to fix in small chunks and steps. If you screw up the way that data is stored and accessed, you're often looking at a ground-up rewrite (or close to it) to fix it.
Not for something the scale of Baseball Reference which probably includes a lot of full text searches. These DO NOT cost effectively scale for these types of searches. One SOLR node could handle the traffic of a cluster of slaves handling similar queries with greater speed and accuracy.
This is dizzying--I have no idea what the point is here. You say in one breath that databases can't scale and in the next mention a great example of one that does.
Not true, ElasticSearch does NOT require a back end database.
This is flat-out incorrect, as I pointed out in my initial post; like SOLR, it uses Lucene to maintain a fairly extensive database including term data, reverse indices, proximity information, frequency data, etc.
With the exception of full-text scanning ("Let's run grep on the entire data set!"), which is mind-boggling slow on large data sets, all text searching engines use a database. Nearly all of them using at least a term index and usually a stem database, on top of whatever other bells and whistles they supply.
I'm glad you agree. In my opinion, a site like the poster was inquiring about should be heavily driven by search. Data storage is fungible, that is why there are so many different types for different needs. I could do an analysis and choose the best type, but I'd want to be paid for it.
Data storage in the sense we're discussing is crucial to everything; it's as far from fungible as it gets. Getting it right is probably the single most important technological decision he'll make. As Linus Torvalds says, "Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
Which is nothing new, Fred Brooks said essentially the same thing 40 years ago when he wrote "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."
When data gets bigger and pushes things out of memory and into larger on-disk formats, the way you store data only gets more important--not less.
The best thing he can do is to sit down and sketch out some quick blueprints of the various pages on the site, with each different sort of functionality highlighted. Start enumerating all the kinds of things you can do: "Search by player name", "Sort tables by column", "Aggregate columns by clicking", "Do canned reports that sum x, y, or z over multiple years/months/whatever", "find averages of a,b, and c over multiple months", "find splits by opposing pitcher/ballpark/handedness/whatever", "allow arbitrary nested splits--e.g. as a righty in night games before the all-star break--on individual player pages", "allow searching/sorting of arbitrary splits over the whole dataset--e.g. who's the best batter of all time from the right side of the plate in Fenway in July-August".
That gets you halfway to putting together a functional specification. From there you start looking at how efficient various operations are if you store the data in different ways: "Oh, if I want to be able to calculate OPS in night games vs. lefties then I need to store data in this more granular manner to cook it at query time--which is okay for one player but prohibitively slow for generating a 'top batters by OPS at night vs lefties in MLB History' report. Hmm, how to address that?"
From there you fight to find a data store that's reasonably efficient at all your functional cases--possibly limiting functional cases some if you need to. Once that's done, you transform the data into the required format(s) and then you can start thinking about the UI side of things (which hopefully was addressed generally by the blueprints).
Get a well-designed UI and a well-engineered data store and the rest of it should fall out pretty organically, technologically.
#18
Posted 30 December 2012 - 09:31 AM
The best thing he can do is to sit down and sketch out some quick blueprints of the various pages on the site, with each different sort of functionality highlighted. Start enumerating all the kinds of things you can do: "Search by player name", "Sort tables by column", "Aggregate columns by clicking", "Do canned reports that sum x, y, or z over multiple years/months/whatever", "find averages of a,b, and c over multiple months", "find splits by opposing pitcher/ballpark/handedness/whatever", "allow arbitrary nested splits--e.g. as a righty in night games before the all-star break--on individual player pages", "allow searching/sorting of arbitrary splits over the whole dataset--e.g. who's the best batter of all time from the right side of the plate in Fenway in July-August".
That gets you halfway to putting together a functional specification. From there you start looking at how efficient various operations are if you store the data in different ways: "Oh, if I want to be able to calculate OPS in night games vs. lefties then I need to store data in this more granular manner to cook it at query time--which is okay for one player but prohibitively slow for generating a 'top batters by OPS at night vs lefties in MLB History' report. Hmm, how to address that?"
From there you fight to find a data store that's reasonably efficient at all your functional cases--possibly limiting functional cases some if you need to. Once that's done, you transform the data into the required format(s) and then you can start thinking about the UI side of things (which hopefully was addressed generally by the blueprints).
Get a well-designed UI and a well-engineered data store and the rest of it should fall out pretty organically, technologically.
This is solid advice, please let me second. The technology choices should be easier to rationalize in the context of some wire frames and story boards.
#19
Posted 30 December 2012 - 01:15 PM
This is solid advice, please let me second. The technology choices should be easier to rationalize in the context of some wire frames and story boards.
This is great. Thank you guys.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












