errgroup Package Overview in Go @Golang NYC

November 9, 2021

< 1

According to: https://pkg.go.dev/golang.org/x/sync/errgroup, Package errgroup provides synchronization, error propagation, and Context cancelation for groups of goroutines working on subtasks of a common task. In this lightning talk the Golang NYC User Group Meetup, Dima Gershovich walks us through errgroup use cases for you to maximize its uses in your everyday programming life.

View Slides Here

Speakers

Dima Gershovich

Software Engineer - Xray Team

Video Transcript

all right hello again everyone uh my
name is dima
um
i’m
working in jfrog
for the last five years uh all this time
i was working as a go developer
and
actually only last couple of
last cup couple of weeks ago i
encountered this
package first time in my professional
career
and
it was a discovery so i wanted to share
it with everyone because i find it
useful
and i hope that you will find it useful
as well
and just some random tech about me i
like playing piano
so
that’s about me now let’s go into the
problem description
so i think you all are familiar with the
weight
group weight group comes with go out of
the box and it is a means of
synchronizing
spawning some workers and then waiting
for all those workers to continue
so this is a very common pattern
we just define weight group
we spawn some
workers for each worker that we spawn we
need to
add
to increment the work group
then we spawn the go routine
and after the go routine finishes we
mark it as complete by the call
done
and then we
issue a call in the end that is called
weight
and this weight ensures that all the go
routines are finished when this call
returns what the weight call returns we
know that all our go routines are
finished this is a super common pattern
but
what is missing here
and is super
common need
what happens if
we have a function that returns an error
and we want to know and maybe fail the
whole operation if error is returned
from one of those workers
so now we need regularly
we would need to introduce some kind of
channel pass it to the worker wait on
this channel
and
we have actually something that can rub
this for us
in a pretty nice way
it doesn’t come out of the box with
golan
you need to install it as a model
but it is official from google called
air group so it this code is kind of
symmetrical to the code that we saw
earlier
instead of the scene group
sync weight group
we define an error group
and then
when we spawn uh the go routine instead
of directly calling the go
the go command with go gofunk
we can call instead
uh on the group the method go and pass
it a function then can return an error
so
this code is and then in the end we
call the wait method and the wait method
waits until all the go routines are
finished
and it will return an error
if some of the go routines return and or
it is enough that one of the
those workers returns the narrow that
the
uh our air group will return there
actually it will return the first one
the first error that we received this
will be the one that is returned and all
the others will just be
not presented ignored
so this is how it behaves
um
but then
we want probably
to
add something else we want to say all
right but
i maybe
if i had error in one of the workers i
want to process to stop all the
processing altogether
right there is i already know that the
operation has failed it is enough that
one of the workers failed that all the
operation failing the whole operation
and they want to signal to all of them
to stop and uh to continue the pro stop
the processing because there is no need
to continue processing anymore the
operation has failed so how do we
approach that
so for that
we have uh
another fractionality of their group
with example that i’m going to present
now
uh so air group is actually
very very simple and small interface it
has two methods that we already are
familiar with which is the go method and
the weight method and one that i didn’t
introduce yet and i’m gonna introduce
now that is with con initialization with
context
okay
so what we did previously we just
initialized the pointer to the group but
another way would be
to call the with context method that
would return the group
and then
um
we could wait we could
wait not we could use this context to
see if it was cancelled or not
i’ll going to demonstrate shortly so
what is let’s take a very very common
example
a pipeline
okay
a pipeline consists of some sender task
generator one routine that generates
tasks
and sends them to workers we have group
of workers that are processing the tasks
that the generator sends to them
task generator
generates tests and in the end after it
completed generating all available tests
it should signal that it finished
generating the tasks so that the workers
know that no need to wait they can exit
in case some error happened
either in the generator or in any of the
workers we would like to everything to
stop
the generator to stop all the workers to
stop and to exit
and the main process should wait for all
the routines to complete
and if some error happened it will
return an error if now error returned we
will
have a success every all the processing
is completed
so this can be implemented
in regular go but
the air group
provides a very elegant way to do it
with minimal code
so how would we in approach that how
would we implement that
uh air group with cancellation
so we initialized air group like that
air group with context
we give it some context if no
context from before then we just take
the background
and then
um
it’s quite similar to the example that i
presented before
just we will have another go routine
which will be the task generator so we
call g go
this will be our task generator or some
method that’s called generate test
and i initialized the hero channel on
which we the sender would be sending
all the tasks
to the tasks processor workers
and so generate task will receive two
parameters
the context
and the tasks channel
and the workers
they
all they also would have
their method task processor worker and
it will receive
as well context tax channel and i just
send also the
number of the worker
and we wait for everything to complete
with g waves if any of those go routines
either the generator or the workers
return an error
then uh
way gene dot weight will return with a
nerve
so now the question is how do we
know to stop
when we had some error how do we know to
stop
so
inside this is why we pass the context
all the go routines that are working are
being passed the context
and then let’s see both in the sender
and in the receiver what do we do in the
sender
in the sender we have some loop
for example we were looping over updates
and we generate task for updates it’s
just a pseudo code
and during the test generation we can
have an error
so if we’ll have a rare error we just
stop the process and return the error
um
okay
and anyway
we um
we make sure to close the task channel
we make sure this uh closing the channel
it sends signal
to the receiver that
they can stop
processing tasks okay because it
finished task processing has finished
from the sender side the sender will not
send any tasks anymore on the channel
this is just common go practice okay to
send
sender sends on the channel and after he
completes to send on the channel he
closes this channel from his site from
it should be closed only from the center
side
so but sender also listens for
errors that can come from
the workers so how does it know
um
there is this select
and the select has two clauses
close to either
to send the task to the channel
or receive contacts done contacts done
would will be called
this context done
this is actually a channel
down channel that is closed
when the any of the workers returns an
error
and
go works that if some if a channel is
closed then we can receive immediately
from done so this select statement it
will return are when one of the case
statement will be ready arbitrarily
we’re not promised any order so either
some uh
worker receives from the tasks channel
we try to send and someone received from
it
or
the done channel was closed
so if
here
some worker will return an error
the air group will close the dun channel
and we will
a we will exit
on the contacts done and we just break
from the loop and return we break in
case of air we break from the loop
return then because of the defer the
channel will be closed the task channel
will be closed now how it looks from the
receiver side
in receiver side we have a loop
that
[Music]
runs on on the tasks channel this loop
will be ended automatically when the
channel is closed
so we keep trying to receive tasks
and then how do we watch for
cancellation the same we have a select
and this select has two clauses
either done
that means that some error happens
and then we exit
or
if not if no error happened we’re just
processing the tasks that we’ve received
and
that’s uh that’s how we can
satisfy all
all the requirements of the pipeline
example
so
that was my example
very very common scenario actually
and one that is kind of tricky
to implement
and air group makes it quite
elegant in my opinion
if you notice any problems or mistakes
feel free to
to reach out now if you have any
questions great hey dammit thank you so
much that was that was really really
great any questions uh
any questions anybody wants can take
yourself off mute and ask or you can
drop it in the chat and i’ll read it um
either one of the two would be great
no going once
going twice
actually i do have a question here this
is mihai
um can you talk a bit about the
performance of this like in some cases i
see that people are
not too happy with channels and
i’m not sure if you’ve seen any
performance issues with it or
like do you see cases where you have to
like just drop to like using atomics
synthetics
yeah
well uh it it really depends uh
like what scale of magnitude we’re
talking about
right so generally speaking
um
when we’re using uh channels uh
the
functions themselves like the
those methods that are generating and
processing tasks they just
many many order of magnitudes more the
processing takes many magnitudes of
orders more than
the synchronization
that happens with the channels
so it doesn’t really matter like if
processing task takes like a hundred or
five millimeters more time than the
synchronization mechanism i wouldn’t try
to optimize on that but if it is if the
processing is something that is super
super fast
then and it is comparable in order of
magnitude with the synchronization then
we and it is very very important like
real time i don’t know then we can start
thinking about i actually never worked
like one problem that
where the processing is on the same
scale as the synchronization
does it make sense the answer yeah thank
you so much that that’s that’s what i
was kind of thinking myself thanks a lot
uh
there was a question um how do you
capture error for pipeline design
pattern using our group
how do i capture error for pipeline
design
what do you mean
can you uh can you clarify that some
more renee
um how is the error captured
the weight returns the first the way it
returns for the first error return
that’s right we can look also in the
internal implementation it’s really
really simple the internal
implementation i encourage i encourage
you all just to look at it
um we can just go
i’m sorry for jumping here
where is our air group
here
so actually what happens behind the
scenes is that there is this air once
mechanism and we save the first error
that we received
so this method
when some error is returned
we just save it internally like that
and it will happen only once for the
first error
that’s actually very short and sweet
implementation
yeah we’ll also also exit immediately so
all the other go routines uh this is
actually you don’t have to
do the context cancellation pattern in
order for this to occur
uh the even without invoking the air
group with context the very first error
will close the group
now um
the methods are running we they should
know when to stop
the all the go routines are not finished
they should finish by themselves so they
should actively monitor this context
the context done to know where to stop
because otherwise nothing terminates
them externally they they should know by
themselves that they need to stop
working of course we could just let them
work until and be finished
at least
if the channel would be closed we could
just uh do it in this example only from
the sender when the sender knows that he
is complete
um
then he would close the channel but what
could happen is that he already
completed sent all the messages for
example if we would have channel that he
has some buffer we could have a
situation when the sender is already
completed but the receivers are still
in processing and maybe those are things
that are taking time
and there is something in the buffer and
we want them all to and simultaneously
so i think like the most robust solution
that makes no assumption is both in the
center and in the receiver to uh watch
for this uh channel
before it is done
great thank you very much we may have
time for one more question does anyone
have one more question
okay
dima thank you again so much
really
really really uh pleased that you uh
stayed up very late for us and it was a
really enjoyable and educational talk
for all of us so thank you again and uh
look forward to uh
looking forward to seeing you again soon
welcome thank you
um without any uh i’m gonna go ahead and
drop all those links that i mentioned to
you first i’ll drop the jprog link
followed by all the birthday activities
for the week it’s a big big long text
so feel free to
check it out explore etc
um and i at this point i’m gonna also
welcome um ahmed uh who’s gonna be
sharing with us
um and he’ll he’ll give you some more
details about his background but very
excited about the topic this evening and
uh ahmet mishra from director of
engineering from fox is gonna share with
us uh about uh
uh some very very exciting um a very
exciting talk that he actually gave over
at go west uh earlier um not too long
ago actually that was only a couple
weeks ago wasn’t it ahmed
yeah i think uh two weeks before yeah
yeah
so this is a good so this is a good
fresh talk i’m assuming this hasn’t been
done at too many meetups since then so
uh we’re really really pleased to have
you on but thank you for coming
thank you thank you so much for the
opportunity and uh wish you all um you
know happy birthday to golang the whole
golan community and nice meeting to you
guys uh let me share my screen real
quick um
are you able to see my screen
cool
yeah um
as already mentioned that yeah i’m amit
mishra um today i’m going to deliver the
top on building a scalable api platform
using um golang
um
well what is the motivation behind uh
this talk basically you know like as a
director of engineering at fox me and my
team is responsible for
uh building and
kind of taking care of
an api platform
which helps to serve live and on demand
contents to millions of concurrent users
um and as as you understand um serving
live content uh has its own challenges
because of dynamic nature of the users
uh their locations and the content
itself
um our api platform helps to serve the
contents for the events like super bowl
uh nfl
wwe pvc or um you know um
lots of fox entertainment related uh
shows um as part of this particular talk
i’m not going to cover the whole api
platform i’m just going to cover a very
interesting portion of our api platform
which is the playback uh portion of it
um
so yeah let’s dive into real quick uh
the agenda for this talk um first of all
i’m going to go over um uh and explain
what is playback api platform for us
and then we will and we will understand
that uh challenges and we’ll see why did
we think that we had to rebuild it and
why did we go you know uh to rebuild it
um also how exactly we uh rebuilt it um
and then we will go over uh some couple
of results
right um
let’s look into what is api uh playback
api platform so for for us at fox
playback api platform is basically um a
group of back-end services which uh of
course helps to serve the live and what
contents um uh to various fox products
um but at the same time uh these
back-end services helps to generate um
actual playback url for the users which
basically um serve the actual video for
any user so that that’s a very uh
critical portion of these services uh
apart from that um other critical
activities which the service these
services performs are like collecting
analytics
for the from the clients and based on
that you know taking um
actions uh
um apart from that um there are like um
other things which uh these api server
like um couple of metadata information
normal ui decoration like title
description and all those kind of kind
of things
um
once you play any video as part of the
video um
the online video experience the most
critical portion is the serving the ad
experience also um and that becomes very
uh important for us because it uh it is
directly related to our revenue so that
is one of the thing um all critical
activity these services performs
um let’s go over a very high level
architecture for this particular uh uh
existing legacy system so legacy system
um used to be a kind of
monolithic uh
lega monolithic services um based
written based on the nudges it had
couple of internal modules and each
internal module uh had very complicated
business logic because you can imagine
uh in uh when you are serving any live
content to a
particular user it depends on lots of
various factors the factors like okay if
user is subscribed or user is
unsubscribed versus user is
which location user is based on um and
similarly you know the content itself um
is available for that particular reason
or not right so these are like a couple
of examples where these things becomes
more complicated and because of that um
these um
decisions becomes um more difficult to
make
especially at the back-end side
and apart from that um as i said like it
doesn’t happen it doesn’t have any one
internal module that had like 20 to 30
internal module and these intent modules
were like talking to one data store like
something like elasticsearch or dynamodb
um but at the same time we also had
dependency on the external services um
where we are basically depending on
streaming url kind of systems or add
kind of systems you know um
so um what were the challenges with this
uh particular uh
uh
you know um architecture so as you could
see this whole thing was very tightly
coupled in my own right and
because of this nature um
it was very hard for us to scale uh this
whole system um especially um after uh
like in since last one year or so um we
we had a very different traffic pattern
on the digital streaming worship is
increasing and the traffic pattern in
some of the cases are very unpredictable
uh and in order to um basically uh
fulfill those uh traffic pattern we need
to have uh a head we need we need to
have a way to auto scale these things uh
in such a manner so that we can you know
serve all these users
other challenges are like external
dependent services without any fail
forward strategy so as you could see in
this architecture um all these external
services were directly connected right
so if something fails from the external
services point of view
we
had to fail the whole user experience
which was not a pretty good experience
from the user point of view so we were
picked uh totally depending on the mercy
of the external services in this case
like you know
and we did not have any um
control that how do we control uh if
something is happening
from
their failure point of view um the other
challenges were like as i said like we
had dependency on the single uh
back-end data store
and the data stores like elasticsearch
and all are very difficult to you know
scale on on demand uh you require lots
of uh preparation before we could do
that so um that is something
we wanted to you know um
basically solve as a problem for us
um there were other challenges where
like uh we did not have any caching in
most of those uh scenarios and initially
this was done because of the dynamic
nature of the live content itself
because um
contents are live users are you know
also
coming from the different different
locations so caching those kind of
contents are very hard in advance so
that is something we wanted to explore
that and how exactly we can you know um
handle these kind of problems
other things was like you know hard to
release any feature since this whole
back-end services service platform was
being used across different different
fox products um
uh releasing anything um came up with
very high amount of the risk because um
even a single change might have any side
effect and it might break
some
different experience onto some different
product so we wanted to have some some
way um to start releasing our future
faster
other other the biggest challenge we had
was the that
once even if we wanted to
you know build a new api platform how do
we build this api platform and how do we
migrate to this new api platform without
any downtime because as i said like all
these are live events we have every week
we have couple of events scheduled like
it starts from thursday um friday
saturdays and then right now there’s nfl
events going on um as you guys are aware
and uh so because of that nature um we
still needed to
have the ability to migrate to the new
api platform without disturbing our
current operational mode so that was a
pretty interesting um
problem to handle um when we were
building this api platform
other other kind of goal we wanted to
have was
like we wanted to have our api platform
by default uh supporting to you know at
least um
millions of concurrent users uh with uh
without any liveops or you know um any
on-call kind of support because um as i
said like after in last couple of years
our digital viewership is increasing
significantly so um because of the
increased amount of the traffic on the
normal day we cannot expect all the
folks sitting and you know supporting
these kind of things so that that is
what we wanted to have our default
platform um should have the capability
to support at least millions of
concurrent users similarly for the super
bowl kind of event um we wanted to have
the ability to have at least 10 millions
of concurrent users um so that
you know
we can fulfill the requirement from the
user point of view
other goals were
which were pretty important for us for
like
handling the scenarios like thundering
hard um and uh a thundering hurt kind of
scenario can happen if in some of the
normal cases like ad breaks or some of
the you know problems where at crash
happens in case of app crash um imagine
you have millions of concurrent users uh
watching the
stream uh or watching a particular event
and suddenly if your app crashes then as
an user you are gonna retry a couple of
times to get into the streams right and
when that happens
all these users will send the unexpected
traffic to your backend system and uh
and and if this happens this in some
some of the cases this kind of traffic
becomes very unmanageable
and uh so we wanted to have some
innovative ways to handle this
this kind of traffic pattern and the
same applies for the ad breaks uh when
when there is a event going on and we
are in mid of the event and suddenly if
there is a
ad marker
and there is a ad break happening then
all the users and all the platforms are
going to hit our backend system which
serves the ad at the same time right so
in that case um and and these ads are so
dynamic in nature because of the um
users uh locations and user subscription
profiles um
caching those kind of details is very
hard in advance so those are
um very dynamic right so we have to
serve those and that also kind of
creates a unique problem of funding hard
um
and other goals we wanted to handle was
like you know um how do we actually
support the game expenses for any kind
of live event because whenever there is
an event going on there is there are
higher chances that that event might be
you know um extended so the way um uh in
digital uh streaming world we basically
sell these events like we have we
allocate the slots in advance okay this
we just predict like okay this
particular event is going to happen for
these many hours okay and after that we
have this particular program is
scheduled so all these things are
decided in advance and uh you know are
kind of like served to the users but uh
if the live event uh is extended then
that whole scheduling kind of like you
know um uh disturbed and we have to
reconfigure those things and if that
happens then the whole backend system is
kind of
reset itself and
so that um it can serve um or continue
serving those live event but at the same
time because of that it might also
create thundering hard kind of problem
because your whole system is going to
reset even if it was cash right
other future goal was like you know as i
as i mentioned that we wanted to have
some ability to have at least a graceful
degradation as i said like if any
external services are failing or
internal services are failing um then we
wanted to have some ability to monitor
those and take some corrective actions
and then at least implement some kind of
fail forward scenario so that um your
user experience is not interrupted um
and and um
in line to that one we wanted to have
the ability to support uh robust
monitoring and end-to-end uh you know um
logging or observability system so that
we can uh see what’s going on with our
system um and of course um
we wanted to have our platform you know
resilient for
common failures like infrastructure
outage or third party system issues and
all all those kind of things
but how exactly um we handle those
problems one of the thing um i want to
mention that while um
while looking into those challenges um
we uh
we basically made a conscious choice of
choosing go
because of its
simple nature um you know most of our
developers were coming from the java or
the node.js background and
none of us had you know kind of like
experience it was like seven or seven
years before um but when we started
doing small poc based on the golang we
got lots of confidence and we wrote like
our first first service like for account
system login service and we deployed it
and
once we
deployed that in production we never had
to look back um and our performance
increased um for that system
significantly and that was the
experience we were carrying while
thinking about rebuilding even playback
platform platform
and
yeah so let’s look into how exactly we
tackled all these challenges so um
as i said like um
in the beginning uh
our legacy system was very tightly
coupled it was kind of a monolithic in
its nature so one of the thing we
decided okay let’s start thinking about
breaking this whole system uh as true
micro services right um and uh
and the criteria which we decided was
well like very obvious like okay first
thing okay what are how many external
services we were depending on right
secondly i think um take how many
internal services we had um which we
cannot migrate uh to golang but can we
do something around those services um so
that we can still uh
implement field for what kind of
scenarios right and the third and the
most important aspect was like how do we
um uh you know handle the dependency on
the data stores like elasticsearch and
dynamodb so we basically
broke those kind of pieces into its own
microservices and
so that they are just having dependency
on the elasticsearch and dynamodb
instead of like having external and
internal legacy services
so one of the um
one of the thing um as part of the true
microservices when we thought about was
like okay um uh that how exactly we
migrate even if we
even if even if we wanted to
um
build a new system how exactly we
migrate so that that was the strategy um
i wanted to present here so here is the
like one of the examples state here like
initially like this is how the very
simplistic view looks like from the
client point of view we have some api
gateway endpoint and there will be some
legacy back in uh playback uh service um
not to initially start with
and uh what we did what we started doing
in our system first was like we started
introducing uh golang based proxy on top
of all these um uh you know legacy’s
services so that this particular pattern
uh
is a pretty strong pattern uh you know
like it provided us the lots of
flexibility it opened uh the doors for
lots of opportunity in background um in
this case we did not make any changes
the kind of job proxy was doing it was
just taking the same request it was
taking the same response uh to the
client it was giving back the same
response to the clients um
but at the same time it uh gave us lots
of flexibility which i’m going to
explain you in
a couple of next slide and once we had
the proxy um
what we started doing we started making
our services in new services in parallel
to legacy services and we started
swapping those components one by one
into its own microservices and
once when we were doing that we were
able to do the comparative testing also
because we had um the whole um you know
uh uh production data uh and the live
data coming in as part of our unlocked
so that we could basically grab those
things and then we can you know uh
simulate the test and see what was the
behavior with the new um playback
services and once we had that we kind of
like you know um implemented uh a
feature flagging kind of you know
concept there where uh
we deployed our services to production
it is live but we still kind of like
saved us like we still put together put
put this to behind a feature flag so
that if there is any issue going on or
something happened you know we can
turn on and turn off uh based on what
kind of behavior we are seeing and once
we had that and once we had that con
confidence okay everything is working
fine then we were able to remove on the
proxy we were able to able to remove
feature flag and we had our new playback
service in on production and without any
downtime um so that was like one of the
innovative ways we came up with um in
order to
deploy our services and this is very
established pattern at fox um we we are
still in process of um migrating our
whole platform to new golang based
platform
and this is the pattern we are basically
following right now
um another pattern i would like to talk
about is like circuit breaker pattern um
as i as i was explaining that um
as part of the first proxy proxy thing
uh
um it gave us lots of flexibility and
the proxy gave us lots of flex
flexibility because uh because of the
circuit breaker pattern so whenever we
implemented a go link based proxy we
also implemented a circuit breaker
pattern with those uh with those proxies
and as soon as um we implemented circuit
breaker pattern uh basically it gave us
the flexibility to fail forward
we could basically um
we could see we could track all the
responses from the um
external services or the you know older
services and track the uh error codes
and
fail based on that and
probably provide you know um on the
field forward uh response
so that is what we did this is one of
the
circuit breaker conflict um uh
we have uh so currently like uh as part
of the implementation we use uh histrix
go package um um and uh this is the
config it looks like like as part of the
circuit breaker pattern um
we basically had uh you know couple of
things con in a configurable mode like
if we want to enable the tracing tracing
basically gives the gives us the full
you know dump of the logs in case we
want to debug something um we could
also we like um configure like error
threshold where we want to see okay
circuit should start a circuit should
open uh after certain num percentage of
the errors so by default we kind of like
do like 20 to 25 so if uh uh like if 25
of requests are failing or circuits will
be open
and if that happens then we all
automatically go to the field for what
kind of scenario um as i said like we
also needed to have the capability to
understand if it is external services or
you know internal services so we had
that flag and configurable um similarly
at the circuit breaker pattern level
itself we kind of like uh implemented
max concurrent request count so that um
we are not sending you know um
unexpected amount of the traffic we were
able to control the traffic at our uh
you know caller services level itself
um and the same like request volume
threshold also like you know um
complement to the same kind of feature
uh and then there are other features
like timeout like in
in case external services are very slow
to respond then we needed
some way to you know um time out those
services and then based on the time of
capture those errors and then basically
take the corrective action um for these
services
um as i said um
because of the uh sim uh simple nature
of the golang uh um we were able to uh
uh implement uh most of our circuit
breaker code in such a manner that as a
developer once we had that implemented
we did not have to do lots of job when
we were right uh calling those things
into another services as part of the
color services as as you can see like
whenever we wanted to integrate any
integrate two services this is what we
had to do we literally had to like you
know implement our um
circuit uh circuit circuit breaker um
struct and then um
pass couple of patterns
and
you know capture the response and send
it back that’s it so as a dev uh it
became very easy whenever we wanted to
integrate with any services and as part
of uh circuit breaker configuration as i
said like we also had the options to you
know um uh basically configure uh like
things like retry like in case there is
a failure happen do we want to retry
that service or not like you know
like for example in timeout let’s retry
something like that right so um it gave
us lots of flexibility from the um
implementation and the development point
of view um but at the same time it also
opened the doors for us to monitor our
system um
in a very flexible manner this is one of
the circuit breaker based monitoring now
you can see like um we had bunch of um
like as part of one services uh service
we had a couple of other services which
which are being called dependent
services and uh this is how we were
tracking okay these are the durations
you know this much time it took for us
to basically call those services they
are we all we were also able to track
like historics errors or hysterics
timeouts and then status our circuit
circuit itself that if it is open or
closed
and based on that our ox team can take
the correct action
once we had the thing implemented we
started think i think thinking about
caching
and
as i said like implementing and caching
uh in live environment is kind of very
tricky right but we still um made a
conscious choice of start caching a
couple of things uh just for few seconds
um not for like longer period but that
few second of caching itself gave us uh
so much scaling uh scalability power to
our you know our services and uh one of
like couple of features we implemented
uh using golang best casing was a leaky
bucket kind of caching or a steel cash
right leaky bucket basically had our um
uh hardware
cash to keep warm all the time you know
without any manual effort or without
running any background job um for you
know um
for
warming up the cash um steel cash
basically helped us to serve the content
in case of any external services failure
so um
based on the previous response we could
cash those things uh from the external
services um of course we cannot cash
everything but we were very selective
about okay what can be cash and what
cannot be cash based on that
we could catch some of the content uh
which is cast for longer period of time
and we could serve those things in case
of um you know any major failures
um this this is one of the example we
had um implemented so as part of the
caching it’s like a normal pattern
normally people uh try to implement like
um we we get lots of bunch of video ids
from the request point of view we try to
go and get all the things in batches
from the cache and then
and then if something is missing then
we go
and
get individual ids and data from the
cache and again cache those back into
in background basically asynchronously
um
so this is uh just the same thing um
happening here um
one of the uh very interesting uh
package we found as part of the golang
library was a single flight it basically
helped
us to implement uh or take care of the
thundering hud kind of uh problem um
as as you can see like um or understand
like a single flight is basically
just a you know a pattern uh which
basically helps you to suppress all the
duplicate requests so in case of like i
i was talking about the ad break kind of
example in case of all the clients who
are sending the sim who are sending
similar kind of ad requests we were able
to identify all those duplicate requests
and instead of sending all those
duplicate requests to um
origin
through the single flight implementation
we sent only one request to origin uh
got the data we cached the data and then
uh served the rest of the request from
the cache cache itself so that kind of
like you know um
provided us so much flexibility when it
comes to
when it came to have had the dependency
like elasticsearch or single data store
itself because now we were able to um
serve uh most of the content from the
caching itself instead of sending all
the contents or all the requests to um
data store itself
um other pattern um which were very
useful was like api gateway um
throttling so as i explained
i was explaining the app crash kind of
scenario those kind of scenario and the
kind of traffic you start getting is
very unpredictable right so for those
kind of things um we made a conscious
choice of uh you know um
disabling the auto retry first of all
and then let the uh
let the client initiate um uh or let the
client retry uh for the playback and if
uh we were getting lots of requests we
basically kind of like um throttle those
requests because um
our api has um
in finite
level of scaling capabilities right now
but at the same time and the truth is
like all all other external services
wouldn’t have those kind of capabilities
so in order to do so we had to protect
ourselves in some of the cases this does
not happen in all the time but for very
high um traffic event we had to have
this kind of
things implemented
as i said um we kind of like you know
disable the auto retry for some of the
major
events and dependent on the
client uh related retry
and
not only that as i i was explaining that
we have a couple of mores like we have
the events like super ball and we have
the events like you know nfl wwe which
are the normal uh uh like every weekend
kind of scenarios right so for those
kind of things we wanted to have some
capability okay
how do we prepare ourselves from partial
versus major failures right so that’s
this is where we came up with like okay
defcon mode kind of uh
system where uh def con
5 was like a normal behavior um um
you know where things are running
absolutely fine um but if in case
anything is failing then we had
the feature flags implemented um where
we can just turn it on and there will be
some degradation on the features but
your user experience from the streaming
point of view will be is still fine and
user as a user you can still watch the
event without any problem um and and in
those cases whenever we turn on the
splits and whenever we implemented this
kind of defcon mode these did not uh
require any kind of restart from the
backend services point of view it was
very seamless from the client point of
view um you wouldn’t even know that you
are running on a degraded uh you know
feature system
so that was pretty cool um from you know
from the failures point of view when it
came to handle those
um
what were the results after we
implemented uh these all all of these
like as i said we have an uh infinitely
scalable playback platform um today um
uh and uh
and we have very very fast services you
know like as you can see this is just
one example but there are like terms of
example we have that where we had to
perform uh uh we had um you know the
node.jsp services performance kind of
like not that great um but at the same
time uh when we as soon as we uh went
back to golang uh went to golang
we got like you know um
lots of improvement
and and because of this um what happened
like or number of ecs container count
for the services were reduced to 50
percent uh the way uh golang manages the
resources it helps us to save lots of
cost
and and our overall expanding on the aws
itself became you know reduced by uh
like 30 to 40 percent uh it’s we are
already seeing those results and and
this is one of the reason like we are
now like going towards golang more and
more
as i said like um um
the implementing and the releasing
feature uh
was um faster was very critical to us
and after moving to the golang based
microservices approach we are able to
release our um features you know 100
times faster we have every day
we have at least you know um live
deployment going on without breaking
anything um
and it does not require any kind of
interruption or any kind of like setup
of war room people can just go to slack
they can um
hit the deployment when as soon as their
feature is uh basically tested um and
verified and approved um they can go
ahead and deploy it um and it does not
affect anything uh in that case so
that’s pretty that’s a pretty big
flexibility we wanted to have and now we
have that one um similarly um as i said
like we have better tooling to monitor
apis um end to end and uh currently like
super ball 2020 we served in 4k and lots
of even current nfl event we are serving
on 4k uh and because of the um
this whole uh refactoring onto golem
based uh services it provided us those
capabilities to add these kind of
features um you know um very fast and
deploy and test those out
yeah and at the end as i said like we
have a very high successful um our team
this is a small moment from uh super
bowl 2020. um
yeah and uh yeah if if you guys or
anyone is interested on joining our team
please uh feel free to reach out to me
we are actively hiring for various roles
so um
please yeah reach out to me also um
if you guys uh would love to read about
more uh cool stuff um on that what are
we doing at fox we have a medium uh uh
platform set up there fox stack you will
find lots of good technical
implementation using golang into those
uh so feel free to uh you know go over
those and if you have any question
please feel free to reach out to me
thank you
thank you so much uh ahmed that was
awesome you can you can unmute i
accidentally muted you and i tried to
unmute me
um but uh i know uh we’d love to have
some questions i know one person asked a
question in the chat uh do you write
did you write an api gateway or did you
use an open source or third party
component for that
we use uh aws api gateway
oh okay great yeah
uh
let’s see bill has a question as well um
any idea how much of the performance
improvement was due to the to the node
or to go change versus other
improvements like caching
yeah so no to golang uh
we we all we saw a lot
big performance boost as i said like
just
we had a normal node.js service running
um and we as soon as we added just the
go line-based proxy services that itself
gave us so much um performance boost
without doing any anything um we were
also puzzled that how did it happen but
then we figured out the way golang was
managing the resources internally uh cp
utilization and manufacturing um was uh
pretty awesome and that is where we
saved lots of things so definitely like
compared to load we started seeing those
um
performance boost
um and on top of that we when we started
adding other other patterns it basically
gave us more flexibility and
boost great
any other questions anyone want to take
themselves off mute and ask ahmed any
questions i thought it was a really
interesting uh you know uh definitely
def con zero in my house if uh the super
bowl is on i’ll tell you that and
there’s a failure that would be uh that
would be horrendous and so it’s really
interesting to see how things are going
on behind the scenes uh especially with
things that are very uh you know very
much in uh
very much in our everyday lives
especially with covid and all the
streaming that we did uh it’s great to
see the technological advances and the
architecture that you and your team put
together at obviously a major uh live
streaming uh
uh outfit for sure
why did you choose ecs over the
other container orchestration options
yeah easy um this is fargo because
itself managed by um you know um aws one
of the thing um and uh it was so
basically we are we are uh our whole
platform is on aws um uh we are not
multi-cloud uh
system um everything is on aws so that
was like kind of like obvious choice for
us you know um and so we used a docker
um to basically you know build the
images and uh just go through the hdr
and the parties yes
so
um it basically fits very well with the
aws ecosystem that was one of the reason
you know we went with that one it
doesn’t
uh we don’t require to manage lots of
things on our own or manual manner
when it comes to ecs
okay great
any other questions um anybody want to
come off mute and ask something for
ahmed
yeah
thank you for the talk i mean hi here
i’m really curious since you’re saying
that you’re so closely tied into aws i
assume all your stack is running there
um what does it look like for you to
test things like how do you do
integration tests you spin up
infrastructure in aws and wait for that
to run the tests or how do you do that i
would imagine that’s kind of slow that’s
why i’m asking
yeah so we um
we we have our own load kind of
environment set up
based on aws but we use k6 um
load you know to write our script and
all so we have the whole system and in
order to run uh run our load test we
basically plan those out in advance and
test it based on that one um because uh
based on the nature of the event we know
we kind of like predict what kind of
traffic we might be expecting so we
basically add more padding on like 10x
or 20x on top of it and then that’s how
we kind of like spin up and test it
well thank you and and can you can you
talk a bit about how much time it takes
for you to spin up that environment like
is that not kind of annoying for people
to to wait for all that infrastructure
to spin up in aws
um
honestly it’s not a big headache because
um for the load
load thing we have a pretty much
established process we kind of like you
know schedule those in advance you know
um
so um it takes like you know maybe half
an hour to one hour but those things are
already set up in advance by our ops
teams you know so that all the engineers
doesn’t have to sit down and wait uh to
come up with this environment also these
environments are pretty um temporary in
nature as soon as your load tests are
done and those are destroyed so that we
can save some costs we are not just
running by default on those kind of mods
like
okay thanks for that like one hour
sounds doable i’ve been in the past in
organization where it took like a couple
of hours and it was like a nightmare
uh no we i don’t think we have we i
think when we were setting up initially
like you know these kind of things we
had those challenges but currently the
process is pretty smooth it doesn’t take
like you know more than i would say
couple of hours like you know one or two
hours but i i’ve never noticed more than
that
if it will be then yeah that’s a problem
i agree with you
and they deploy with teraform or claw
formation we use terraform so as part of
our whole micro services skeleton which
we
wrote in golang
we have the terraform modules uh
implemented so as a dev when you’re
writing your services um you can
basically uh write your own
infrastructure let’s you want to spin up
your own dynamo or you know you want to
configure sms or sqs you can you have
full control as a dev to build those
things as part of your services itself
cool that sounds great thank you
welcome thank you we have we have
another question here uh from bill
curious if you know how much your
productivity improved moving from node
to go
also um by allowing much easier use of
better pattern um by circuit breaker etc
does that make sense
yeah totally makes sense so yeah that’s
one of the
most strongest point golang provided us
like the
the speed of development like uh
initially um writing one microservices
uh used to take us at least two sprint
or two weeks today like if it is a
simple micro services we can uh spin up
a production ready microservices within
couple of hours and i would give an
example of super bowl 2020 whenever we
were writing those micro services we
were literally like finding out all the
bottlenecks during our testing on fly
and we were writing those proxy services
within two hours of time and we were
deploying those in production that was
the kind of speed we we were able to
achieve so normal you know
normal service itself or product like
development time has decreased at least
like 80 to 90 percent um after moving to
the line based uh services
and what was the other uh question on
that one sorry maybe i
oh yeah the other patterns um
i would say um
we are exploring uh in this one apart
from the caching is like more innovative
ways of caching it’s uh caching the
things like russian doll kind of caching
or those kind of pattern we are
exploring right now um
also um we are also exploring the
options of you know um
implementing on the multi-region kind of
capability for ourselves so that we we
stop depending on desk on what kind of
things you know um so that’s one of the
things we are exploring uh for sure also
um as i said um
we are heavily depending on
elasticsearch and a couple of months
before aws launched the
features for the replicating the elastic
such
data so now we are going to basically
implement those kind of features also
where we can start replicating our data
um on live you know uh so that it kind
of gave us more uh
scaling power
okay great there’s a question regarding
the monitoring tools uh can you please
describe what monitoring tools and
products you use did you embed them in
your code
yeah so the monitoring tool we use uh
grifana um and statsd so it starts the
uh as part of our um
cooling microskeleton
itself um
we have um implemented the statsd
package so um and whenever we want to
send any metrics um we just basically
you know uh as a tab you just have to um
you know call those dependencies pass
past what uh what kind of metrics you
want to track and send it across and the
same applies for the logging also um so
from the stack point of view we use
grifana for the visual point of view but
internally it’s basically influx db
based system and uh
stress the end telegraph
great
um
what is the best way to learn more about
the roles you’re hiring for
yeah definitely let me share the link
here as part of the chat uh you know
so that um
that’s great
yeah here um me uh personally me and my
team is uh hiring for uh uh software
engineer senior software engineer and
software uh manager and senior software
manager schools and we are actively
hiring
we need
so many people so please feel free to
reach out to us you know i would love to
work with you guys
great
awesome well we really appreciate the
talk tonight ahmet and uh for dima who i
know signed off a little earlier because
of the time in israel but uh a couple
things i did drop the uh jfrog raffle
link one more time uh for the switch
light for anyone that is uh interested
i’d like to see about having a meet up
in december i hate to miss a month it
just seems like it’s uh i don’t know
it’s we’ve been going so strong but at
the same time you know it is it is the
holidays and uh
you know having a meet-up just for the
sake of having a meet-up is not
necessarily great too so we’ll see if we
can get some good content for the month
of december to uh and i would love for
it to be somebody that’s here tonight so
please let me know if that’s something
that you would like to do
and uh our as far as uh we had two
people ask for uh i’m not gonna spin the
wheel because two people asked for a uh
jet brains license tonight and they are
our winners that’s bill bartlett and
also
we had been asked by
vinaya um i will
uh i will go ahead and send those out to
you
um hopefully this evening um you’ll get
those go ahead and activate those as
soon as possible because based on your
ac
activations jetbrains continues to send
them to us so i love being able to give
those away
and trying to think if there’s anything
else this evening i think we’re good
lots of thank yous in the chat and uh
and you have anyone help anyone else
have any announcements or anything to
share before we head out
hey addy have you ever had the talk on
benthos so bentley’s not that the
streaming processor i don’t think so is
that something you’d like to share on i
could i mean i’m gonna talk about it at
the golang north northeastern meetup but
if you’re okay to have a duplicate of
that then i could try to do that
i don’t think to my knowledge there’s
not a lot of crossover in that uh so
yeah i would love to if you want to go
ahead and forward me an abstract and
everything uh are you on the uh slack
channel for go
oh yeah yeah yeah me too i’m always
you’ll you’ll always see me on there so
go ahead and slack that over and let’s
let’s let’s talk about that for next
month that would be awesome yeah
probably yeah around december sometime
yeah december 25th what do you think is
that a good day
as long as we have some audience
absolutely now we’ll figure we’ll figure
out a good time but that’s great thank
you so much for that yeah you don’t have
to go through the call for papers
process that’s out there in case uh just
in case but uh you can always reach out
to me directly and uh
you can chat about that i mean that
might be of interest to admit here i’m
not sure if he looked at it but uh if
you’re curious about data streaming and
like using a stream processor uh
benthos.dev is something that might
interest you
very
cool yeah definitely would love to hear
about it
i’ll put the link in the chat
awesome that’s great well i really
appreciate it uh anybody else have
anything else tonight
awesome well thanks again ahmet and uh
everyone have a wonderful thanksgiving
holiday and we’ll see you next month
have a good day good night