Shannoah G.
asked 02/14/20Data analyst sample question
The table has 4 fields: date, event_name, video_id and user_id. Write a query that will return the number of users who viewed at least one video in a given day and the number of those users who returned the next day to view at least one video. The video_play event signifies that a video was played by a user. Imagine the actual data set is much larger than shown in the sample data.
Sample Data:
| date | event_name | video_id | user_id |
| 2018-01-01 | video_play | 51651561651 | 989189198 |
| 2018-01-01 | video_play | 98121651656 | 561884864 |
| 2018-01-01 | video_play | 78918918918 | 561884864 |
| 2018-01-01 | create_video | 32156541355 | 153215651 |
| 2018-01-01 | create_video | 87351531311 | 232135135 |
| 2018-01-02 | create_video | 51651561651 | 489846581 |
| 2018-01-02 | video_play | 12315315352 | 561884864 |
| 2018-01-02 | create_video | 32156541355 | 321351351 |
| 2018-01-02 | create_video | 87351531311 | 231513515 |
| 2018-01-02 | video_play | 98191891894 | 615616516 |
Desired output:
The output should have one row per day. Each row should have 3 columns, date, users_view_videos and users_view_next_day.
- users_view_videos is the number of users who viewed at least one video in the given date
- users_view_next_day is the subset number of users from user_view_videos who also view a video on the next day
From the sample data provided above this should be the output:
| date | users_view_videos | users_view_next_day |
| 2018-01-01 | 2 | 1 |
| 2018-01-02 | 2 | 0 |
1 Expert Answer
Jonathan K. answered 03/25/20
Harvard grad, philosophy MA, Wall St bank IT guy -- loves to teach!
This example involves two concepts: self joins and derived tables.
Start by noticing that each row of the data set has exactly one date on it. However, our output has to refer to counts of videos watched by a customer on two dates on each row: a day and the next day. That means we need two copies of the table -- that is, we join the table to itself, where the copy on the right side of the join has the date shifted one day forward.
That self-join will give us the basic data we need, but to aggregate it as required, we then need to treat that self-join as a derived table, and do an appropriate count on it. Concretely, the query will look like this. (I called the data table "sample".)
select
date
, count(user_id) as user_view_videos
, count(next_user) as user_view_next_day
from
(select
s.date
, s.user_id
, t.date as next_date
, t.user_id as next_user
from
sample s
left join sample t
on s.user_id = t.user_id
and s.event_name = t.event_name
and t.date = s.date + 1
where s.event_name = 'video_play'
group by s.date, s.user_id, t.date, t.user_id) as base
group by date
Another point to notice: we use a left join, because we want to cover the case where a user watches videos on one day, but not the next. A regular inner join would remove those. Note also that, depending on the flavor of SQL you are using, the date addition which is just a +1 here might require a DATEADD().
If you want to check this yourself in an appropriate IDE, here are the table scripts:
create table sample (
date date,
event_name varchar(32),
video_id bigint,
user_id bigint );
insert into sample values ('2018-01-01','video_play',51651561651,989189198);
insert into sample values ('2018-01-01','video_play',98121651656,561884864);
insert into sample values ('2018-01-01','video_play',78918918918,561884864);
insert into sample values ('2018-01-01','create_video',32156541355,153215651);
insert into sample values ('2018-01-01','create_video',87351531311,232135135);
insert into sample values ('2018-01-02','create_video',51651561651,489846581);
insert into sample values ('2018-01-02','video_play',12315315352,561884864);
insert into sample values ('2018-01-02','create_video',32156541355,321351351);
insert into sample values ('2018-01-02','create_video',87351531311,231513515);
insert into sample values ('2018-01-02','video_play',98191891894,615616516)
Still looking for help? Get the right answer, fast.
Get a free answer to a quick problem.
Most questions answered within 4 hours.
OR
Choose an expert and meet online. No packages or subscriptions, pay only for the time you need.
Arvind P.
Here is a hint use having clause which ranges to the input date +1 (after grouping by the event name video)03/22/20