Main content
AP®︎/College Computer Science Principles
Storing data sets
The digital world is filled with data. Social media apps collect data about our posts and our likes. Online stores collect data about the products we view. Advertising agencies collect data about what we click.
How is all this data stored? It may start off in a computer program, where variables store the data in memory, but it eventually needs to be stored in a persistent storage format. The data needs to be accessible after the program stops running and ideally, easy to analyze.
Let's look at a few options for storing data, from simple text files to sophisticated databases.
Text files
Ever played a game on a computer? Many computer games keep track of your high scores to help you measure your personal progress.
A game can store your high scores in a text file on the computer.
The file could be as simple as this "highscores.txt":
73
82
83
99
What if the game also wants to store the date of the high score and the level achieved? Each line in the file would need to store multiple pieces of related information. We call this "tabular" data, since each line is like a row of a table and each row has multiple columns.
A common format for storing tabular data in text files is comma-separated values (CSV).
This is what might be stored in "highscores.csv":
date,level,score
01/11/2019,9,73
02/13/2019,10,82
02/14/2019,10,83
03/11/2019,11,99
The first line in the file declares the columns "date", "level", and "score". The subsequent lines contain the actual rows of data, with the date first, then the level, then the score, all separated by commas.
CSV files are a common format, so there are many tools that can read and write CSV files. Some of those tools are user-facing, like spreadsheets applications. For developers, there are libraries in programming languages to read and write CSV files from their applications.
There are definite drawbacks to CSV files, however:
- It's a single file, so its size is limited by the computer's hard drive space and by the amount of time it takes to open a large file.
- It doesn't come with any built-in tools for querying the data, like sorting by a column or calculating the top value of a column. To analyze the data, we need to either open the CSV in a spreadsheets application or write programs that iterate through the data.
Spreadsheets
A spreadsheet application is a tool for storing, organizing, and analyzing data. Spreadsheets applications can typically open a variety of popular text file formats (like CSV, TSV, and XLS) and save the data back into those formats.
Here's "highscores.csv" imported into Google spreadsheets:
A wide range of people use spreadsheets, since they can analyze data without needing any programming skills.
However, applications do not use spreadsheets as their data storage mechanism, since a spreadsheet application is a tool for people, not for programs.
Databases
Most applications store data in a database, a system that stores data on a computer in a way that can be easily accessed, updated, queried, and deleted.
Behind the scenes, a database also stores the data in files. However, the database management system takes care of all of the details for us, like splitting the data into appropriately sized files and remembering what data is stored in each file.
To interact with the database, as programmers or data analysts, we often use a query language. The most popular query language is SQL (Structured Query Language).
We'll introduce some SQL here to give you a feel for how to store and analyze data in a database. If you'd like to be able to write SQL yourself, you can learn how in our Intro to SQL course here on Khan Academy.
The following SQL example creates a database table to store high scores. The left side contains the SQL, while the right side displays the database schema and all the rows in the table:
As you can see, a query language is similar to a programming language. We make things happen by calling commands, like
CREATE
, INSERT
, and SELECT
, and we customize those commands with additional details about what we want to create, insert, and select.A database typically contains multiple tables, to store different types of collected data.
For example, an online game needs to store a table for its user profiles and a table for the high scores across all the users.
This SQL example creates two tables, one for users and one for high scores, and displays the rows in each of them:
Now that the database is storing collected data in tables, we can analyze the data to come up with useful insights. That's what we'll dive into next.
🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!
Want to join the conversation?
- Are the text files stored on personal computers or the corporation's computers (or their servers?)?(2 votes)
- It depends on the situation. A game you have downloaded on your computer that runs entirely offline will keep the files (including text files) directly on your computer. However, a game that you have an account for and play online will probably have files stored on your computer as well as on corporate servers.(6 votes)
- Dawg help i've been hacked(3 votes)
- why type Select in the end of the schedule(2 votes)
- The
SELECT * from [table]
will get [table] and output all rows and columns to the screen for you to see(3 votes)
- For the CSP exam will I need to have learnt SQL in more details?(0 votes)
- From the author:No, the CSP exam does not include SQL questions. However, there is no CSP multiple-choice exam this year anyways, due to COVID-19. I will be releasing an updated version of this course in June which students can use in the 2019-2020 school year.(6 votes)
- What kind of data structure is used to store tables in a database program? Is it simply a 2d array, or are they doing something more sophisticated?(1 vote)
- Jesus loves u all <3(1 vote)
- You said database also stores data in text files, how these text files look like? Is the format like csv files?(0 votes)
- csv is a format used to store information in textfiles, so exactly like that. But it can also look differently. it mostly depends on how the programmers implemented data storage. The csv format has the advantage that you can expect other programs to use it as well.(3 votes)
- So if my data set has 445 'rows' and 5 columns and one 'column' has 5 different values, how can I separate those values to calculate their values against the other vars?
I am using SAS Studio. I have cleaned my data to where I only have the Branch, AVGLoanAmt, AVGPrice, MEDCreditScore, so I can take each Branch and calc these values (AVGs, COUNT, MED) per branch. I think I am overthinking this by a long shot.(0 votes) - Why are some "texts" and why are some "integer"?
When do you use text and integer?(0 votes)- Basically speaking, an integer is a number.
Going more in depth, an integer is a variable that is stored as a whole number, and can be used in mathematical calculations by the computer. The level and score of the player are integers because they are stored simply as numbers.
The reason that the date is "text" instead of an "integer" is because the variable that represents the date (e.g. 1/27/2022) isn't stored as a simple number. The computer can't do anything to the date using math directly. It's text because the only way the date could be used directly by the computer is in messages (e.g. "The date is 1/27/2022.")(1 vote)