Subrat's Technical Blog: SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join

A nice SQL challenge was presented to me by a colleague. The challenges basically consisted of this table. A table contains records that describe login events. Each record has a login timestamp and the identifier of the person logging in. The challenge is to count “unique” login events. These have been defined as unique, non-overlapping two hour periods in which a person has logged in. Such as period starts with a login that is at least two hours after the start of a previous period for that same person.

Visually this can be shown like this:

You can follow along with this challenge in the live database environment offered by LiveSQL at this link.

All the records marked with a star are counted. All other records are within 2 hours from a starred record and are therefore excluded because they are considered duplicate logins

I followed the following logic while constructing a solution for this challenge:

For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record
All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)
Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count
Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result

Each bullet is implemented in the query with an inline view:

earliest_logins: For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record

block_starters: All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)

duplicate_logins: Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count

and finally:

Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result

The last step that should be added is of course the count operation (select l.personid, count(login_time) from final group by personid).

Overall query and result

This result is for this set of records:

Resources

LiveSQL – demo at this link.

Anti Join in SQL – my earlier anti search pattern exploration https://technology.amis.nl/it/anti-search-patterns-sql-to-look-for-what-is-not-there-part-one/

As by my understanding of the original challenge description, I think that this solution does have a problem.

Specifically, for your sample data set, the rows to be counted should be as follows:

PERSONID LOGIN_TIME COUNT_YN
——————————————-
1 2021-01-01 00:00 Y
1 2021-01-01 01:00
1 2021-01-01 01:59
1 2021-01-01 02:00
1 2021-01-01 02:39 Y
1 2021-01-01 03:00
1 2021-01-01 04:59 Y

2 2021-01-01 01:01 Y
2 2021-01-01 01:30
2 2021-01-01 02:00
2 2021-01-01 05:00 Y
2 2021-01-01 06:00

12 rows selected.

In other words, for personid = 1 the solution should return 3 rows only, and not 4.

Depending on whether the 2 hours interval limit is inclusive or not, the solution should include either the row with time 02:00
( if the 2 hour limit is exclusive ) or the row with time 02:39 (if the 2 hour limit is inclusive),
but, anyway, it cannot contain the row with time 03:00, which is less than 2 hours after the previously included row.

The problem is that, for personid=1, the “block_starters” subquery only returns the first row (time 00:00),
while in fact there are 3 rows

It looks to me that the problem cannot be solved without using one form or another of a “procedural” approach.

Here are 3 solutions, using 3 different approaches ( considering the 2 hours interval as being inclusive ):

——————————————
a. Using a recursive WITH query
——————————————

with recurs (personid, lvl, login_time, last_time, rn) as (
select personid,
1 lvl,
min(login_time),
min(login_time) + interval ‘2’ hour,
1
from logins
group by personid
union all
select r.personid,
r.lvl + 1,
l.login_time,
l.login_time + interval ‘2’ hour,
row_number() over(partition by r.personid
order by l.login_time) rn
from recurs r,
logins l
where l.personid = r.personid
and l.login_time > r.last_time
and r.rn = 1
)
—
select personid, login_time
from recurs
where rn = 1
order by personid, login_time
/

PERSONID LOGIN_TIME
——————————–
1 2021-01-01 00:00
1 2021-01-01 02:39
1 2021-01-01 04:59
2 2021-01-01 01:01
2 2021-01-01 05:00

5 rows selected.

——————————————
b. Using MATCH_RECOGNIZE
——————————————

select personid, login_time
from logins
match_recognize (
partition by personid
order by login_time
measures
match_number() mn,
strt.login_time as login_time
one row per match
pattern (strt same*)
define
same as (login_time <= strt.login_time + interval '2' hour)
)
/

PERSONID LOGIN_TIME
——————————–
1 2021-01-01 00:00
1 2021-01-01 02:39
1 2021-01-01 04:59
2 2021-01-01 01:01
2 2021-01-01 05:00

5 rows selected.

——————————————
c. Using the MODEL clause
——————————————

select personid, login_time
from (
select *
from logins
model
partition by (personid)
dimension by (row_number() over (partition by personid
order by login_time) as rn)
measures (
login_time,
'N' count_yn
)
rules (
count_yn[any] order by rn =
case
when cv(rn) = 1 then 'Y'
when max(case when count_yn = 'Y'
then login_time
end) [rn < cv()] + interval '2' hour < login_time[cv()] then 'Y'
else 'N'
end
)
)
where count_yn = 'Y'
/

PERSONID LOGIN_TIME
——————————–
1 2021-01-01 00:00
1 2021-01-01 02:39
1 2021-01-01 04:59
2 2021-01-01 01:01
2 2021-01-01 05:00

5 rows selected.

Thanks a lot for your attention & Best Regards,

Iudith Mentzel

Subrat's Technical Blog

Wednesday, March 3, 2021

SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join

Resources

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

Report Abuse