Beam SlidingWindows是Apache Beam中的一个实用程序函数,用于在一个PCollection中生成重叠的窗口。在使用此函数时,发现元素没有按预期方式复制到多个窗口中。
出现这个问题的原因是,Beam SlidingWindows使用Python中的itertools中的函数tee()来生成重叠窗口,但是tee()函数不是“安全”函数,可能会生成重复元素。
为了解决这个问题,我们可以使用另一个itertools中的函数zip()替代tee()。zip()函数可以安全地迭代两个或多个可迭代对象,并将其元素组合到一个元组中。这可以确保不会复制元素。
以下是使用zip()函数解决问题的代码示例:
from itertools import zip_longest
class CustomSlidingWindows(object):
def __init__(self, size=2, offset=1):
self.size = size
self.offset = offset
def __eq__(self, other):
return isinstance(other, CustomSlidingWindows) and \
self.size == other.size and \
self.offset == other.offset
def __repr__(self):
return 'CustomSlidingWindows(size={}, offset={})'.format(self.size, self.offset)
def __str__(self):
return '{}/{}'.format(self.size, self.offset)
def assign(self, element):
start = element[0] - self.size
stop = element[0] + self.offset
return [(start, stop)]
def merge(self, intervals):
return [(min(x[0] for x in intervals), max(x[1] for x in intervals))]
def expand(self, window):
# We expand the window to support back-filling previous elements.
# Also in case of overlapping windows, it might already contain elements from the previous window.
start = window[0] - self.offset
stop = window[1]
return [(start, stop)]
def split(self, window):
idx = int(((window[1] - window[0]) / self.size) / 2)
if idx == 0:
return [window]
else:
intervals = []
for i in range(idx):
intervals.append((window[0] + (i * self.offset), window[0] + self.size + (i * self.offset)))
if (idx * self.offset) < (window[1] - window[0]):
intervals.append((window[0] + (idx * self.offset), window[1]))
return intervals
def get_window_start(self, window):
return window[0]
def get_window_size(self, window):
return window[1] - window[0]
def get_window_coder(self):
return IntervalWindowCoder()
def get_transform(self):
return SlidingWindows(self.size, self.offset)
def get_suggested_key(self):
return KeyParam()
def get_param_names(self):
return ["size", "offset"]
def get_defaults(self):
return ["2", "1"]
def run(): with beam.Pipeline() as p: ranges = p | "create" >> beam.Create([ (1, "a"), (2, "b"), (5, "c"), (6, "d"), (7, "e") ])
results = ranges | "ranges" >> beam.WindowInto(
CustomSlidingWindows(size=2, offset=1),