数据库SQL面试题:GBase 8a求每个用户最长连续登录天数,两个日期的间隔小于或等于 3 均视为连续登录

根据网友的一个鹅厂面试题由来,原始内容摘录如下:求每个用户的最长连续登录天数,两个日期的间隔小于或等于 3 均视为连续登录。比如 01-01 号登录,最近的下一次登录是 01-04 号,两个日期的间隔等于 3 天,因此这两个日期之间的天数都算作活跃天数,一共 4 天。

分析

首先明确,登录是否只考虑天,不考虑时间。也就是只要跨天就算1天,哪怕中间只间隔了1秒。比如2020-01-01 23:59:59 - 2020-01-02 00:00:00。

gbase> select datediff('2020-01-01 23:59:59','2020-01-02 00:00:00');
+-------------------------------------------------------+
| datediff('2020-01-01 23:59:59','2020-01-02 00:00:00') |
+-------------------------------------------------------+
|                                                    -1 |
+-------------------------------------------------------+
1 row in set (Elapsed: 00:00:00.00)

这涉及到了计算方法,如果是天,那么用datediff就可以了。 如果是24小时,那么就得用秒来计算,也就是用timestampdiff。

gbase> select timestampdiff(hour,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
|    0 |
+------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select timestampdiff(minute,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
|    0 |
+------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select timestampdiff(second,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
|    1 |
+------+
1 row in set (Elapsed: 00:00:00.00)

根据粒度不同,如果是天,datediff即可。 如果是24小时,那么用timestampdiff的hour即可。总之这个必须明确。

因为题目没有说,后面我们以默认的【天】为基本粒度。

原始数据登录信息

本信息包括了一天内多次登录,不同时间登录等情况。注意,如下特特意构造了1个月的中断登录时间。

create table test_login(user_id int,login_date timestamp);
 
insert into test_login values (1,'2020-01-01 00:01:00');
insert into test_login values (1,'2020-01-01 00:02:00');
insert into test_login values (1,'2020-01-01 00:03:00');
insert into test_login values (1,'2020-01-02 00:03:00');
insert into test_login values (1,'2020-01-05 00:03:00');
insert into test_login values (1,'2020-01-07 00:03:00');
insert into test_login values (1,'2020-01-11 00:03:00');
insert into test_login values (1,'2020-01-12 00:03:00');
insert into test_login values (1,'2020-01-13 00:03:00');
insert into test_login values (1,'2020-01-14 00:03:00');
insert into test_login values (1,'2020-01-17 00:03:00');
insert into test_login values (1,'2020-01-18 00:03:00');
insert into test_login values (1,'2020-01-19 00:03:00');
insert into test_login values (1,'2020-01-22 00:03:00');
insert into test_login values (1,'2020-02-22 00:03:00');
insert into test_login values (1,'2020-03-22 00:03:00');
insert into test_login values (1,'2020-03-23 00:03:00');
 
insert into test_login values (2,'2020-01-01 00:04:00');
insert into test_login values (2,'2020-01-01 00:05:00');
insert into test_login values (2,'2020-01-01 00:06:00');
insert into test_login values (2,'2020-01-02 00:06:00');
insert into test_login values (2,'2020-01-07 00:06:00');
insert into test_login values (2,'2020-01-11 00:06:00');
insert into test_login values (2,'2020-01-12 00:06:00');
insert into test_login values (2,'2020-01-14 00:06:00');
insert into test_login values (2,'2020-01-18 00:06:00');
insert into test_login values (2,'2020-01-19 00:06:00');
insert into test_login values (2,'2020-01-22 00:06:00');

查看原始数据

gbase> select * from test_login;
+---------+---------------------+
| user_id | login_date          |
+---------+---------------------+
|       1 | 2020-01-01 00:01:00 |
|       1 | 2020-01-01 00:02:00 |
|       1 | 2020-01-01 00:03:00 |
|       1 | 2020-01-02 00:03:00 |
|       1 | 2020-01-05 00:03:00 |
|       1 | 2020-01-07 00:03:00 |
|       1 | 2020-01-11 00:03:00 |
|       1 | 2020-01-12 00:03:00 |
|       1 | 2020-01-13 00:03:00 |
|       1 | 2020-01-14 00:03:00 |
|       1 | 2020-01-17 00:03:00 |
|       1 | 2020-01-18 00:03:00 |
|       1 | 2020-01-19 00:03:00 |
|       1 | 2020-01-22 00:03:00 |
|       1 | 2020-02-22 00:03:00 |
|       1 | 2020-03-22 00:03:00 |
|       1 | 2020-03-23 00:03:00 |
|       2 | 2020-01-01 00:04:00 |
|       2 | 2020-01-01 00:05:00 |
|       2 | 2020-01-01 00:06:00 |
|       2 | 2020-01-02 00:06:00 |
|       2 | 2020-01-07 00:06:00 |
|       2 | 2020-01-11 00:06:00 |
|       2 | 2020-01-12 00:06:00 |
|       2 | 2020-01-14 00:06:00 |
|       2 | 2020-01-18 00:06:00 |
|       2 | 2020-01-19 00:06:00 |
|       2 | 2020-01-22 00:06:00 |
+---------+---------------------+
28 rows in set (Elapsed: 00:00:00.00)

如下我们先按照一步步的分析,解析整个步骤。

拿到本地登录的下次登录信息

通过lead函数,可以拿到下一行的数据信息。其中必须用user_id做partition,因为是多人。另外要通过登录信息正向排序,以便拿到上一次的。最终用

lag(login_date)over(partition by user_id order by login_date) login_date_next

拿到上一次的登录时间 login_date_last。 最后一次登录的下一次为NULL。

gbase> create table test_login_last as select user_id,login_date,lag(login_date)over(partition by user_id order by login_date) login_date_last from test_login order by user_id,login_date;
Query OK, 25 rows affected (Elapsed: 00:00:01.07)

gbase> select * from test_login_last;
+---------+---------------------+---------------------+
| user_id | login_date          | login_date_last     |
+---------+---------------------+---------------------+
|       1 | 2020-01-01 00:01:00 |                NULL |
|       1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 |
|       1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 |
|       1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 |
|       1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 |
|       1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 |
|       1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 |
|       1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 |
|       1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 |
|       1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 |
|       1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 |
|       1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 |
|       1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 |
|       1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 |
|       1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 |
|       1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 |
|       1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 |
|       2 | 2020-01-01 00:04:00 |                NULL |
|       2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 |
|       2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 |
|       2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 |
|       2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 |
|       2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 |
|       2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 |
|       2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 |
|       2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 |
|       2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 |
|       2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 |
+---------+---------------------+---------------------+
28 rows in set (Elapsed: 00:00:00.01)

计算登录时间差距

计算本次登录和上次登录的时间差,这里用datediff进行,如果是其它粒度,请参考第一节的介绍。

获得了diff字段,内容格式2个时间相差的【天数】。注意本例中,一天内的多次登录的天数差距是0.

gbase> create table test_login_diff as select user_id,login_date,login_date_last,datediff(login_date,login_date_last)diff from test_login_last;
Query OK, 25 rows affected (Elapsed: 00:00:00.86)

gbase> select * from test_login_diff;
+---------+---------------------+---------------------+------+
| user_id | login_date          | login_date_last     | diff |
+---------+---------------------+---------------------+------+
|       1 | 2020-01-01 00:01:00 |                NULL | NULL |
|       1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 |    0 |
|       1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 |    0 |
|       1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 |    1 |
|       1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 |    3 |
|       1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 |    2 |
|       1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 |    4 |
|       1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 |    1 |
|       1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 |    1 |
|       1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 |    1 |
|       1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 |    3 |
|       1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 |    1 |
|       1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 |    1 |
|       1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 |    3 |
|       1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 |   31 |
|       1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 |   29 |
|       1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 |    1 |
|       2 | 2020-01-01 00:04:00 |                NULL | NULL |
|       2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 |    0 |
|       2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 |    0 |
|       2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 |    1 |
|       2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 |    5 |
|       2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 |    4 |
|       2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 |    1 |
|       2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 |    2 |
|       2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 |    4 |
|       2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 |    1 |
|       2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 |    3 |
+---------+---------------------+---------------------+------+
28 rows in set (Elapsed: 00:00:00.01)

获得连续登录天数

连续登录,按题目要求,是间隔时间小于等于3。因为一个人可以出现多次登录中断,而每次中断就都需要重新计算【连续】天数,所以要先计算其中断的次数。

这里将登录天数>3的设置为1。

gbase> create  table test_login_flag1 as select *,if(diff<=3,0,1) flag1 from test_login_diff order by user_id,login_date;
Query OK, 28 rows affected (Elapsed: 00:00:00.98)

gbase> select * from test_login_flag1;
+---------+---------------------+---------------------+------+-------+
| user_id | login_date          | login_date_last     | diff | flag1 |
+---------+---------------------+---------------------+------+-------+
|       1 | 2020-01-01 00:01:00 |                NULL | NULL |     1 |
|       1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 |    0 |     0 |
|       1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 |    0 |     0 |
|       1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 |    1 |     0 |
|       1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 |    3 |     0 |
|       1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 |    2 |     0 |
|       1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 |    4 |     1 |
|       1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 |    1 |     0 |
|       1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 |    1 |     0 |
|       1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 |    1 |     0 |
|       1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 |    3 |     0 |
|       1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 |    1 |     0 |
|       1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 |    1 |     0 |
|       1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 |    3 |     0 |
|       1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 |   31 |     1 |
|       1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 |   29 |     1 |
|       1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 |    1 |     0 |
|       2 | 2020-01-01 00:04:00 |                NULL | NULL |     1 |
|       2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 |    0 |     0 |
|       2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 |    0 |     0 |
|       2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 |    1 |     0 |
|       2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 |    5 |     1 |
|       2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 |    4 |     1 |
|       2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 |    1 |     0 |
|       2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 |    2 |     0 |
|       2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 |    4 |     1 |
|       2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 |    1 |     0 |
|       2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 |    3 |     0 |
+---------+---------------------+---------------------+------+-------+
28 rows in set (Elapsed: 00:00:00.02)

计算连续登录中断次数

也就是当前记录以前,出现过flag1=1的sum值。

sum(flag1)over(partition by user_id order by login_date) flag2

gbase> create table test_login_flag2 as  select *,sum(flag1)over(partition by user_id order by login_date) flag2 from test_login_flag1;
Query OK, 28 rows affected (Elapsed: 00:00:01.26)

gbase> select * from test_login_flag2;
+---------+---------------------+---------------------+------+-------+-------+
| user_id | login_date          | login_date_last     | diff | flag1 | flag2 |
+---------+---------------------+---------------------+------+-------+-------+
|       1 | 2020-01-01 00:01:00 |                NULL | NULL |     1 |     1 |
|       1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 |    0 |     0 |     1 |
|       1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 |    0 |     0 |     1 |
|       1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 |    1 |     0 |     1 |
|       1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 |    3 |     0 |     1 |
|       1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 |    2 |     0 |     1 |
|       1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 |    4 |     1 |     2 |
|       1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 |    1 |     0 |     2 |
|       1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 |    1 |     0 |     2 |
|       1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 |    1 |     0 |     2 |
|       1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 |    3 |     0 |     2 |
|       1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 |    1 |     0 |     2 |
|       1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 |    1 |     0 |     2 |
|       1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 |    3 |     0 |     2 |
|       1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 |   31 |     1 |     3 |
|       1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 |   29 |     1 |     4 |
|       1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 |    1 |     0 |     4 |
|       2 | 2020-01-01 00:04:00 |                NULL | NULL |     1 |     1 |
|       2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 |    0 |     0 |     1 |
|       2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 |    0 |     0 |     1 |
|       2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 |    1 |     0 |     1 |
|       2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 |    5 |     1 |     2 |
|       2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 |    4 |     1 |     3 |
|       2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 |    1 |     0 |     3 |
|       2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 |    2 |     0 |     3 |
|       2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 |    4 |     1 |     4 |
|       2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 |    1 |     0 |     4 |
|       2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 |    3 |     0 |     4 |
+---------+---------------------+---------------------+------+-------+-------+
28 rows in set (Elapsed: 00:00:00.00)

计算每次中断内的连续登录间隔

根据user_id和flag2, 计算最大和最小登录时间的差距。

gbase> create table test_login_continue as select user_id,flag2,datediff(max(login_date),min(login_date)) cont from test_login_flag2 group by user_id,flag2 order by user_id,flag2;
Query OK, 8 rows affected (Elapsed: 00:00:01.11)

gbase> select * from test_login_continue;
+---------+-------+------+
| user_id | flag2 | cont |
+---------+-------+------+
|       1 |     1 |    6 |
|       1 |     2 |   11 |
|       1 |     3 |    0 |
|       1 |     4 |    1 |
|       2 |     1 |    1 |
|       2 |     2 |    0 |
|       2 |     3 |    3 |
|       2 |     4 |    4 |
+---------+-------+------+
8 rows in set (Elapsed: 00:00:00.01)

得到最长的连续登录间隔

这里根据题目要求,需要+1,1号到4号间隔3天,计算为4天连续登录。

gbase> select user_id,max(cont)+1 cont from test_login_continue group by user_id order by user_id;
+---------+------+
| user_id | cont |
+---------+------+
|       1 |   12 |
|       2 |    5 |
+---------+------+
2 rows in set (Elapsed: 00:00:00.15)