Getting Started With SQL For Data Science

Structured Query Language (SQL) is an invaluable skill in data science. But, unfortunately, most new data scientists view it as a second set of skills and don’t realize it should be part of their current skill-set. If you learned Python, Java, or R, learning SQL is a piece of cake!

There are specific quirks to this language. For example, SQL queries must be structured correctly to ensure they are performant and fit your data needs like a glove.

Over time, you will master the do’s and don’t of using SQL efficiently in your work. But first, you must understand the basics.

Understanding SQL As A Data Scientist

Without basic knowledge of SQL and how it connects data, many data analysis tasks will not produce the expected outcome. In job markets today, knowledge of SQL is one of the most prevalent skills requested by programming and data science head hunters. So I recommend getting a head start on learning and mastering SQL.

Check out this survey from StackOverflow which ranked SQL as the 4th most popular scripting language after JavaScript, HTML/CSS and Python.

https://insights.stackoverflow.com/survey/2021

Short SQL syntaxes are easy to understand and remember. But, unfortunately, the long, complicated ones are not! Thankfully, as a data scientist, you don’t need to learn every complex SQL query; you can research and use them whenever you need to.

The good news is that SQL language is structured like English. So most people who can speak and read basic level English can write SQL queries quickly. Below are some of the reasons to learn this query language:

  • It’s is easy to learn, even for data science beginners. SQL is straightforward, and you can be writing queries on your very first day of learning. Check out this SQL and MongoDB tutorial from W3Schools (https://www.w3schools.com/python/python_classes.asp)
  • SQL is the standard language for interacting with relational databases. It is hybrid in nature and is used across multiple database structures—for example, MySQL and PostgreSQL. 
  • It’s a valuable addition to any programming language; Many skilled developers prefer using SQL queries directly in their code to pull data more efficiently than a block of code.

7 Top SQL Queries Every Beginner Should Know!

I’ll do a quick dive into some essential queries to use in your data science expedition. But first, let’s get you hooked up with some real-world sample data to load into your database:

Data.gov – Free collection of over 200,000 data sets from the US Government.

FiveThirtyEight – An ABC News’ site with articles, ratings, and essays.

Kaggle – My personal favorite is Kaggle of course. Check out my Kaggle profile here!

Some Useful SQL Queries You Should Know

SELECT

SELECT is a very powerful SQL statement and the most used! The SELECT statement pulls data from a database, and the result is stored in a table known as the result-set.

SELECT FullName, Country FROM Users;

The above returns the FullName and Country from a Users table.

UPDATE

The UPDATE statement modifies existing records in a table and updates them with the new values.

UPDATE Users
SET FullName = 'Cassy Eguakun', City= 'Wolverhampton'
WHERE UserID = 3

The WHERE condition in this query chooses a single unique data row from the UserID column. UPDATE queries are typically explicit, so the UserID is a requirement in the query.

SET is used here to modify the data. The rows to be updated are assigned new values with the SET command.

DELETE

The DELETE query deletes existing data in a table. It uses the WHERE clause to choose the exact record for deletion. If WHERE is omitted, the command will delete all records in the table!

DELETE FROM Users 
WHERE FullName='Cassy Eguakun’; 

INSERT

The INSERT INTO command adds new records in a table.

INSERT INTO Users (FullName, PhoneNumber, Country, Sex)
VALUES ('Richard Aines', '07911111111', 'United Kingdom', 'Male');

You might be wondering what the difference is between UPDATE and INSERT. Like I mentioned earlier, SQL queries are like the English language, and the meanings are no different from spoken English.

UPDATE updates a record with some new data. In contrast, INSERT is used to add new records to the table. With an INSERT query, you’ll increase the size of your table while UPDATE changes values in the table.

CREATE

The CREATE TABLE statement creates new tables in a database. For example, let’s say I want to add a table for blog posts:

CREATE TABLE Blogs(
    PostID int,
    Title varchar(255),
    Category varchar(255),
    ShortDescription varchar(255),
    LongDescription varchar(255) 
 );

ALTER

The ALTER TABLE is a useful SQL query that adds, deletes, and modifies columns in a table. You can also use it to add or remove constraints from an existing table.

ALTER TABLE Users
ADD Email varchar(255);

SQL constraints assign rules to a table and limit the type of data added to a table. Constraints are known as SQL enforcers! For example, you can change a table from Null to Not Null and so on.

DROP

SQL DROP command is used to drop a table from the database. Be careful with this query, as it can completely remove the table structure and data.

DROP DATABASE Blogs;

JOIN

A JOIN command merges records from two tables. It connects corresponding values from two tables and links them up! The beauty of JOIN statements is that you can grab data from two different tables and still leave your database normalized.

SELECT Users.UserID, Blog.PostID, Users.FullName
FROM Users
INNER JOIN Users ON Blog.AuthorID=Users.UserID;

The query will pull all user’s posts where the UserID corresponds with the AuthorID in the Blog table.

Conclusion – SQL is not dead!

SQL is not dead! On the contrary, it is one of the essential skills needed in data science, machine learning, and artificial intelligence. After all, you will be dealing with data every day. So your ability to retrieve and manipulate the data with flair is what sets you apart from your peers.

To become proficient in SQL, you must learn the basics and then move on to mastering what could go wrong! You must understand the issues, or you’ll fall flat on your face or, even worse, lose valuable data due to inexperience.

Below are some of the issues I’ll dive into in subsequent posts to give you a brief synopsis.

  • NULLs and the NOT IN predicate.
  • Incorrect subquery column.
  • Data type mismatch in predicates.
  • Predicate evaluation order.
  • Outer joins and placement of predicates.
  • Functions on indexed columns in predicates.

Leave a Reply