Spark-SQL日期时间函数实战:从数据清洗到业务洞察
1. 电商订单数据清洗实战刚接触电商数据分析时最让我头疼的就是那些五花八门的时间戳格式。有的记录带着毫秒有的混着时区标识还有的干脆用纯数字表示Unix时间戳。记得第一次拿到某平台3个月的订单数据时光是处理下单时间字段就花了整整两天。Spark-SQL的日期函数就像瑞士军刀这里分享几个真实场景中的典型问题和解法。假设我们有张名为orders的原始表里面有个create_time字段存储着2023-07-15T14:30:22.123Z这样的ISO格式字符串-- 第一步标准化时间格式 SELECT order_id, to_timestamp(create_time, yyyy-MM-ddTHH:mm:ss.SSSZ) AS standard_time FROM orders这个to_timestamp函数特别实用它能识别30种时间格式模式。有次遇到个坑某跨境平台数据里混着UTC和本地时间我加了时区转换才解决-- 处理带时区数据 from_utc_timestamp( to_timestamp(regexp_replace(create_time, \\08:00, )), Asia/Shanghai ) AS local_time常见的时间字段提取操作比如按周分析用户活跃度时-- 提取年月日等基础字段 SELECT order_id, year(standard_time) AS order_year, month(standard_time) AS order_month, day(standard_time) AS order_day, date_format(standard_time, EEEE) AS weekday_name FROM standardized_orders实际项目中我发现date_format的pattern参数特别灵活。有次业务方要Q3-2023这样的季度标识用这个就搞定了concat(Q, quarter(order_date), -, year(order_date)) AS fiscal_quarter2. 用户生命周期分析技巧做用户留存分析时datediff和date_add这对组合简直是我的救命稻草。去年做618大促复盘需要计算每个用户首单和复购的时间间隔WITH first_orders AS ( SELECT user_id, min(order_date) AS first_purchase_date FROM orders GROUP BY user_id ) SELECT user_id, datediff(second_purchase_date, first_purchase_date) AS days_to_repurchase FROM ( SELECT a.user_id, a.first_purchase_date, min(b.order_date) AS second_purchase_date FROM first_orders a JOIN orders b ON a.user_id b.user_id AND b.order_date a.first_purchase_date GROUP BY a.user_id, a.first_purchase_date )有个容易踩的坑直接对时间戳用datediff会得到奇怪结果一定要先转日期类型。我曾因此算错了个重要报表后来养成习惯先castdatediff( cast(to_date(timestamp_column) as date), cast(another_timestamp as date) )计算用户30日留存率时date_add配合条件计数特别高效-- 计算30日留存 SELECT first_purchase_date, count(DISTINCT user_id) AS new_users, count(DISTINCT CASE WHEN datediff(login_date, first_purchase_date) 30 THEN user_id END) AS retained_users FROM user_activity GROUP BY first_purchase_date3. 月度业务指标统计每月最后一天做月报时last_day函数帮我省了不少事。之前手动计算月度GMV总要处理各月天数不同的问题现在简单多了SELECT last_day(order_date) AS month_end, sum(amount) AS monthly_gmv, count(DISTINCT user_id) AS active_users FROM orders GROUP BY last_day(order_date)最近发现个更高级的玩法——用trunc函数做自然周统计SELECT trunc(order_date, WEEK) AS week_start_date, sum(amount) AS weekly_sales FROM orders GROUP BY trunc(order_date, WEEK) ORDER BY week_start_date处理财年数据时add_months比手动加减可靠得多。我们公司财年是4月1日开始计算同比就这样做-- 本财年数据与去年同期对比 SELECT current.year_month, current.sales, previous.sales AS last_year_sales FROM ( SELECT date_format(order_date, yyyy-MM) AS year_month, sum(amount) AS sales FROM orders WHERE order_date BETWEEN 2023-04-01 AND 2024-03-31 GROUP BY date_format(order_date, yyyy-MM) ) current LEFT JOIN ( SELECT date_format(add_months(order_date, -12), yyyy-MM) AS year_month, sum(amount) AS sales FROM orders WHERE order_date BETWEEN 2022-04-01 AND 2023-03-31 GROUP BY date_format(add_months(order_date, -12), yyyy-MM) ) previous ON current.year_month previous.year_month4. 高级时间序列分析做促销活动效果评估时需要对比活动前后7天的数据波动。这时候date_sub和lag窗口函数配合简直完美WITH daily_metrics AS ( SELECT order_date, sum(amount) AS daily_gmv, lag(sum(amount), 7) OVER (ORDER BY order_date) AS prev_week_gmv FROM orders WHERE order_date BETWEEN date_sub(2023-11-11, 14) AND date_add(2023-11-11, 7) GROUP BY order_date ) SELECT order_date, daily_gmv, prev_week_gmv, (daily_gmv - prev_week_gmv) / prev_week_gmv AS week_over_week_growth FROM daily_metrics ORDER BY order_date处理用户行为序列时我经常用months_between做生命周期分段。比如划分新客/老客SELECT user_id, CASE WHEN months_between(current_date(), first_order_date) 3 THEN 新客 WHEN months_between(current_date(), first_order_date) 12 THEN 成长客群 ELSE 老客 END AS user_segment FROM user_profiles最近帮市场部做活动复盘时用next_day实现了自动计算每周三的促销效果-- 找出每个周三的订单数据 SELECT next_day(order_date - 7, WED) AS wednesday_date, sum(amount) AS promotion_sales FROM orders WHERE order_date next_day(order_date - 7, WED) GROUP BY next_day(order_date - 7, WED)时间函数的组合使用能解决很多业务问题。比如计算工作日时我常用datediff配合条件判断-- 计算两个日期间的工作日天数 SELECT start_date, end_date, datediff(end_date, start_date) - floor(datediff(end_date, start_date) / 7) * 2 - CASE WHEN dayofweek(start_date) 1 THEN 1 ELSE 0 END - CASE WHEN dayofweek(end_date) 7 THEN 1 ELSE 0 END AS working_days FROM project_plans