If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Storing data sets

The digital world is filled with data. Social media apps collect data about our posts and our likes. Online stores collect data about the products we view. Advertising agencies collect data about what we click.
How is all this data stored? It may start off in a computer program, where variables store the data in memory, but it eventually needs to be stored in a persistent storage format. The data needs to be accessible after the program stops running and ideally, easy to analyze.
Let's look at a few options for storing data, from simple text files to sophisticated databases.

Text files

Ever played a game on a computer? Many computer games keep track of your high scores to help you measure your personal progress.
A game can store your high scores in a text file on the computer.
The file could be as simple as this "highscores.txt":
73
82
83
99
What if the game also wants to store the date of the high score and the level achieved? Each line in the file would need to store multiple pieces of related information. We call this "tabular" data, since each line is like a row of a table and each row has multiple columns.
A common format for storing tabular data in text files is comma-separated values (CSV).
This is what might be stored in "highscores.csv":
date,level,score
01/11/2019,9,73
02/13/2019,10,82
02/14/2019,10,83
03/11/2019,11,99
The first line in the file declares the columns "date", "level", and "score". The subsequent lines contain the actual rows of data, with the date first, then the level, then the score, all separated by commas.
CSV files are a common format, so there are many tools that can read and write CSV files. Some of those tools are user-facing, like spreadsheets applications. For developers, there are libraries in programming languages to read and write CSV files from their applications.
There are definite drawbacks to CSV files, however:
  • It's a single file, so its size is limited by the computer's hard drive space and by the amount of time it takes to open a large file.
  • It doesn't come with any built-in tools for querying the data, like sorting by a column or calculating the top value of a column. To analyze the data, we need to either open the CSV in a spreadsheets application or write programs that iterate through the data.

Spreadsheets

A spreadsheet application is a tool for storing, organizing, and analyzing data. Spreadsheets applications can typically open a variety of popular text file formats (like CSV, TSV, and XLS) and save the data back into those formats.
Here's "highscores.csv" imported into Google spreadsheets:
Screenshot of Google spreadsheets interface, showing this data:
datelevelscore
01/11/2019973
02/13/20191082
02/14/20191083
03/11/20191199
A wide range of people use spreadsheets, since they can analyze data without needing any programming skills.
However, applications do not use spreadsheets as their data storage mechanism, since a spreadsheet application is a tool for people, not for programs.

Databases

Most applications store data in a database, a system that stores data on a computer in a way that can be easily accessed, updated, queried, and deleted.
Behind the scenes, a database also stores the data in files. However, the database management system takes care of all of the details for us, like splitting the data into appropriately sized files and remembering what data is stored in each file.
To interact with the database, as programmers or data analysts, we often use a query language. The most popular query language is SQL (Structured Query Language).
We'll introduce some SQL here to give you a feel for how to store and analyze data in a database. If you'd like to be able to write SQL yourself, you can learn how in our Intro to SQL course here on Khan Academy.
The following SQL example creates a database table to store high scores. The left side contains the SQL, while the right side displays the database schema and all the rows in the table:
As you can see, a query language is similar to a programming language. We make things happen by calling commands, like CREATE, INSERT, and SELECT, and we customize those commands with additional details about what we want to create, insert, and select.
A database typically contains multiple tables, to store different types of collected data.
For example, an online game needs to store a table for its user profiles and a table for the high scores across all the users.
This SQL example creates two tables, one for users and one for high scores, and displays the rows in each of them:
Now that the database is storing collected data in tables, we can analyze the data to come up with useful insights. That's what we'll dive into next.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!

Want to join the conversation?