Re: Questions re: LF Edge Shared Infrastructure Proposal

Andrew Grimberg
 

Greetings folks,

I know it's been a month since this was given to me. I've been trying to
wrangle my thoughts on the best way to present information as well as
dealing with the vagaries of managing the Release Engineering team. That
being said, here's what I've got (see inlined).

NOTE: I was not subscribed to list till just before I sent this. So I've
dug up the original message as well as the follow on message(s) and
hopefully I manage to hit all the questions.

---------- Forwarded message ---------
From: *Gregg, James R* <james.r.gregg@...
<mailto:james.r.gregg@...>>
Date: Mon, Aug 5, 2019, 4:57 PM
Subject: [Edgex-tsc-devops] Questions re: LF Edge Shared Infrastructure

- What work will the LF own within the proposed scope of
work to transition to a common shared infrastructure?
LFRE / LF Operations will take on the migration of all jobs from current
Jenkins instances into the shared instance. We would also do any work
needed to centralize current artifact storage along with any needed URL
rewrites. We would also do any work needed to retool current global-jjb
jobs to handle operating against multiple source SCM systems.

- What help is needed from the DevOps Open Source community?
Any custom jobs maintained by the community that are not safe in a
multi-scm configuration would need retooling. There are potentially
other issues that we do not specifically know at this time with
migrating those custom jobs into a shared configuration and won't be
able to easily figure out until we start trying to migrate them.

Are there any other LF projects sharing a common build infrastructure?
Technically they all do ;) We're just taking something that we've done
at a "Top Level" project (henceforth TLP) level (Akraino and EdgeX
Foundry level) up to the umbrella level and making all repos that fall
under that umbrella use a shared CI infrastructure.

What will the new GitHub repo structure look like?
It would be similar to what we currently do in Gerrit / GitHub. Just at
the LFEdge organization level itself.

Ideally each top level project would end up in a directory under 'jjb/'
with each of the repos under those as we currently do. Essentially we
would take what EdgeX and Akraino currently do in the 'jjb/' directory
and transition all that configuration down one more level. JJB itself
will have no issue with this unless there is a name collision somewhere.

What is the LF Release Engineer's role going to be in terms of ensuring
that name collisions do not occur?
We're going to have to be more proactive in voicing concern over project
names during project proposal time. We will also have to do an
evaluation on if there are currently any name collisions already
apparent. If there are we would need to either retool our current
global-jjb jobs to allow us to give more differentiation in some way or
find a different solution to this problem. Ultimately we probably should
do this anyway as it gives more flexibility to the jobs we already have.

If it’s a shared infrastructure, do we have an opportunity to leverage
common base build images for builds which are building Docker images?
Yes. In fact this is possible now as long as we have all the various
projects in the same availability zone for their CI build cloud, it's
just harder to manage. We already share the common build base image
among all of the projects in a given availability zone.

It seems like one of the reasons for this proposal was to address a
desire to reduce the size of duplicated infrastructure and thereby
reduce technical debt when it comes to maintenance and ongoing support
of the infrastructure.  Is there a plan to share a common DevOps WG
under the LF Edge project so that communications are centralized?
I would strongly encourage the formation of such a group. Not doing so
would mean that we have to bring major changes to multiple groups before
they can be agreed upon.

Does the Linux Foundation have a plan to release roadmaps for when
technical debt will be addressed?
We've started releasing some information. I don't believe the LF Edge
TAC (TOC?) has been given any sort of presentation at this time as we
started with LFN.

Our current plans have us rolling out self-service release artifact
management for Java this quarter along with our initial work on
self-service container releases.

We have self-service committer promotion coming during Q4

Self-service package cloud releases are currently on plan for Q1-2020
along with our initial roll-out of self-service repo creation which
would grant governing bodies more ability to quickly grant a project
resources (git repo, nexus artifact storage configuration, jenkins
settings configuration) that current require a request to the service desk.

During this time we will also be working with some communities on proof
of concept conversions from Jenkins to Gitlab CI and/or Azure Pipelines
where it makes sense. We will also be developing our initial Jenkins
Pipelines library to bring some of our current macro / template
functionality to the Jenkins Pipeline world and thus make it easier for
us to transition off of Freestyle jobs to the more modern Jenkins
Pipeline system.

What other alternatives has the Linux Foundation considered to address
the problem?
We have started to investigate how we can easily and properly support
using third party CI services that already exist. In many cases the
platforms that now exist, that did not exist when we first started doing
CI environment management, will meet the needs of our current customers.
In some cases, the needs could be met with a combination of the third
party services and Jenkins infrastructure that we've historically provided.

The biggest issue we generally find, at this time, is needs of
communities for access to certain types of build infrastructure that is
not provided in third party CI systems and thus our continued reliance
on providing our already developed resources. We have, however, started
to build out tooling and libraries to utilize the newer Jenkins
Pipelines system but still provide much of the curated job design we
already have in the Jenkins Job Builder Freestyle jobs that have been
developed in global-jjb. This, however, is in very early phases.

Are there any plans to look at leveraging Kubernetes for hosting build
automation and leveraging more of a Container as a Service build
automation model?
We have not done this at this time. It's been requested a few times, but
we haven't had the time or resources to do the investigation. I do know
that our current cloud provider does have some managed Kubernetes that
they've mentioned we can integrate into Jenkins and we're going to be
looking into it fairly soon.

Has the Linux Foundation looked at Rancher Labs Rancher 2.0 for hosting
a more modern CaaS platform?
We have not. I didn't even know about it.

What changes will be necessary to support all of the packer build
images?
We designed the common-packer system to be extremely flexible. I don't
foresee us needing any extra work on that being needed aside from making
sure the current build dependencies exist for the projects we're
proposing be merged into a shared environment.

What optimizations of the shared common infrastructure will improve the
overall build automation performance?
In truth, I don't expect we're going to see a major performance
improvement here for either EdgeX for Akraino. What we'll be doing is
making it easier for sibling LF Edge projects to leverage resources that
both EdgeX and Akraino already have since budget was allocated for them
but not the newer projects for LF to host CI.


- We have noticed service degradation when pulling images
from upstream repos (docker hub or other repos)
This sounds like issues that should be raised as support desk tickets.
In particular, if you are pulling directly from docker hub instead of
proxying via the Nexus 3 instance in the environment there isn't much we
can do if there are network issues between the upstream and the CI
environment. We provide the Nexus 2 and 3 systems for both proxy of
supported artifact types as well as for hosting of supported build
artifacts themselves.

- We have seen what appears to be network degradation at
times when pulling build dependencies
Again, if you're pulling directly from upstream repositories the
variables that would be causing such degradation are vast. Not using the
the provided proxies makes it harder to troubleshoot if it's a local
issue or something outside of our control.

- What other technical debt would be addressed within the
proposed scope of work?
Most of the technical debt here is related to running multiple
infrastructures where a consolidation is possible. Fewer Jenkins, and
Nexus instances mean less administrative overhead of keeping systems up
to date. Less administrative overhead gives us back precious cycles for
investigating new ways of doing things and improving our systems
automation as well as new platforms that we can start to support.

- How will ARM builds be optimized in the new proposed
shared / common infrastructure?
We already do some amount of queue separation in other Jenkins systems
depending on job names. We have the ability to do that now, we just need
insight from the community as to names of jobs (we use a regex) to do
limitation / queue jumping to.

Note: ARM builds slower than Non-ARM
This is, unfortunately, the drawback to ARM. It does not have the
performance of x86_64 platforms at this time. Our cloud vendor has been
in regular communication with ARM, in fact they worked very closely to
make ARM in a public OpenStack cloud a thing.

If there's more ARM builds happening (due to the shared
infrastructure), how would the builds not all take longer to
complete?
This is all a matter of queue management. The proposal I made initially
made was that we would still separate the build clouds between each
project. I know that Brett stated that it was all one budget, but LFIT
has operated each CI environment as if each project in the environment
had some amount of dedicated budget for their use. As such, we can
continue along that practice to keep consumption of resources in check
by having each TLP have it's own dedicated build instance cloud and jobs
for projects inside a TLP would have to use resources out of that cloud.
This is one method we could use. Alternatively, we can do more on along
the lines of what I mentioned just above this with doing some queue
separation and an overall increase in the number of build instance we
actually allow Jenkins connect at a time.

What's the timing for the proposal?
This is dependent on two primary factors.

1) Is EdgeX willing to go along with this plan
2) Is Akraino willing to go along with this plan (note, we have not
presented to Akraino yet)

If both of those factors are cleared then it's a matter of working with
the projects on timing. The shorter the amount of time we work through
the less excess budget would be needed for running multiple environments
in parallel. We would also work out if we were to just transform an
existing environment, or build new for a clean-slate migration.

- Need proposed start - end dates that do not conflict
with release dates and/or current development
See answer above. We'll work with the impacted communities to prevent
(as much as possible) issues related to changes.

Are the resources committed to do the actual work within the timeframes?
Again, we will work with the communities on timing and timeframes. If we
lay out a proposed timeline we will do everything we can to meet that plan.

How will the work be coordinated so as to not disrupt current
development?
The Release Engineers currently working on LF Edge will be tasked with
primary work. Some work on coordination would end up moving to the
Technical PM folks we've got working with LFIT. Essentially, we would
treat it as if we're on-boarding a brand new major project and build out
expectations as similarly.

What's the data that supports the claim that supports the proposal for a
shared common infrastructure of all LF Edge projects?
This is an outgrowth of a request for CI resources for the LF Edge Eve
project which is presently operating out of GitlabCI but has needs for
ARM build resources which we can at present only offer via our managed
Jenkins infrastructure. Given that the month to month costs of just
operating one of those environments is more than LF Edge is willing to
spend for Eve we decided to take a look at potential cost savings by
combining existing LF Edge environments. This brings us to EdgeX and
Akraino which both currently have Jenkins environments managed by LF.

Has the cost analysis been completed?
We have a minimal back of napkin cost savings analysis based solely on a
reduction of current managed systems into a single environment. This
really is the primary cost savings that comes out of a shared
environment as we have no plans or intent to curtail current usage which
is the largest and most variable budget item associated with the
environments.

- please share the cost analysis of the proposed savings
As I mentioned, this is a back of napkin estimate as we have not done a
full dive into the financials related to this. Our current estimates
state that after the transition we would start seeing a $603/mo savings
based on the fact we would go from 2X all infra to just 1X for the
environment.

- number of builds for all projects (considered small if under < 500
total builds) - What happens if the next project to join LF Edge
bumps the number of builds >1000?
Our heaviest two projects we currently manage are OpenDaylight and ONAP.
ODL is by far our "largest" in terms of number of builds. We execute up
to ~75 concurrent builds at a time in each environment and have had that
up to as high as ~150 concurrent builds. They both consume a lot of
compute time in those builds, though ODL currently consumes more as they
have a lot of multi-node (3 - 10 node) jobs that execute for 6+ hours.

Yesterday alone ODL executed over 5000 jobs and ONAP did over 750.
Needless to say, we're confident that we can make this work for LF Edge
for a long time with a shared infrastructure.

If you want some insight into our current Jenkins setups you can take a
look at our Datadog dashboards for Jenkins. They're listed on our
inventory [0] page in the far right column. For instance, here is the
EdgeX [1] and Akraino [2] dashboards. For completeness here are ODL [3]
and ONAP [4].

- shared resource model = shared support resources between LF Edge
umbrella projects
Does the shared common build infrastructure also mean that the LF
head count to support that build infrastructure is shared across all
of the same LF Edge projects?
In point of fact we already share LF head count to support the build
infrastructure. You may have noticed this with the shift to the new
Service Desk system, more of the Release Engineering team are helping
out EdgeX. The same is true for all of our projects. Does this mean that
the head count will go down? At this time no. As we are able to
cross-train our Release Engineering staff in all of the projects, bring
our projects closer to alignment in methods in which they operate,
automate more of the current manual processes... at that time we see
about changes to how the head count works out. I do know that there are
discussions internally to transition away from the 1 project, 1 RE model
we've been operating on as it does not scale particularly well. We're
working out how we can do a more pool based method for billing that will
most likely see a cost savings to projects.

What's the actual savings?
At this time, the only savings I can actually see is the hard numbers
around reduction in systems managed. It doesn't look like a lot, but
moving towards a shared model makes it easier for LF Edge to offer new
projects resources that otherwise would be out of reach due to the high
cost of standing up and operating another CI environment that may not be
heavily utilized.

Thank you Andy for coming into the DevOps WG meeting last week.
Hopefully we can get answers and clarify further so we can make some
decisions and perhaps plan accordingly.
I appreciated the opportunity to discuss this. As I said during the
meeting, this is an early proposal and this set of questions actually
helped me articulate more of how I see this playing out along with
concerns that the communities may have with such a suggestion.

===
Questions from Trevor Conn:

1) How will LF mitigate build worker starvation -- which we already
see just in our own project, nevermind a shared environment?
We have two primary controls around the build queues:

1) We can increase then number of build instances that Jenkins is
allowed to bring on line. This is a per build cloud configuration. EdgeX
is currently configured to allow a max of ~40 instances to be online at
a time. Increasing this allows us to quickly drain a queue, it comes at
the cost of higher potential costs if left at a larger number. This is
directly configurable via the ci-management repo at this location:

ci-management/jenkins-config/clouds/openstack/Primary/cloud.cfg

Our initial proposal is that Akraino would have a separate build cloud
connected to Jenkins (like they currently do) and we would thus be
partly managing the queue based upon node labels which is how jenkins
determines what cloud an instance starts in.

2) We can do some queue hopping management to increase priority for
particular types of builds so that they move through the queue faster.
This is regex based upon the job names. We do this on our larger
projects. For instance, we set the jobs connected to managing jenkins
itself (all the jjb and packer jobs) as highest priority as those tend
to be things that are affecting all of Jenkins. Verification jobs are
often set as the next highest priority and most other jobs are left at
the default priority. Priority ranking is 1 - 5 with 1 being highest.
Default is 3. We even set some jobs as lowest priority as they're
generally test or report jobs that are timer based.

Between the two of these configurations we can generally keep the queue
moving fairly well.

2.) Will we maintain the same level of customizability for our build
environment, for example the new auto-tagging Jenkins pipeline jobs?
The short answer: Yes.

The long answer: customization that is job level is easily supported. We
only request that custom job templates and macros be name spaced
appropriately so as to not collide with any other shared tenants. This
is part of the reason that the global-jjb macros are all 'lf-*' and IIRC
our newer job templates are all properly namespaced as well, but we
should double check that ;)

3.) It would be nice to have the ability to cancel jobs. For example,
a PR is created which kicks off verify jobs. A couple minutes later
the dev resubmits a change to the same PR (like a missing rebase or
something). We have to wait for those initial jobs to finish before
the second jobs run.

** It sounded from the discussion like this was more related to the
GitHub plugin, but it's still on my want list **
This is a failing in the GitHub Pull Request Builder (GHPRB) plugin that
we use for Jenkins. It does not presently have an auto-abort feature on
detection of updated PRs. We would love to see this feature as well as
it's part of what allows our Gerrit projects to move quickly as folks
regularly push a change for review and realize they needed an extra fix
and push up a follow on change causing the potentially already running
job to abort or the queue to re-manage itself around the new change.

We have not yet explored if the GitHub Branch Source plugin (which is
now being recommended from the GHPRB plugin page) will rectify this. It
is, however, from the looks of things a pipeline only plugin and we do
not, yet, have official support for pipelines in our environment. That
support is coming by EOY as we have to build out a library that takes
care of a lot of what our current freestyle macros do.

4.) What sort of dashboard will we have? Will we be able to see the
overall shared stats in order to identify delays or quantify usage of
the infra by project?
As I mentioned up top we have Datadog dashboards now available (this was
something that just rolled out in the last couple of weeks). They are
all linked off of our inventory [0] page and again, the EdgeX dashboard
is at [1] and the Akraino dashboard is at [2].

I know that there are additional stats that are supposed to be coming
out as part of the data analytics component of CommunityBridge but I do
not currently know how that is going to shake out, especially in light
of an instance shared between multiple TLPs.

Finally, as I mentioned at the top of this, I've now joined this list,
so if there are further questions I should get them a lot sooner as well
as have a quicker turn around on answers. Especially now that I've got
more of the intent written down in answer to these first sets of questions!

-Andy-

[0] https://docs.releng.linuxfoundation.org/en/latest/infra/inventory.html
[1] https://p.datadoghq.com/sb/57e4b2d73-edaf7ba14e20bc461fc369a19b9bfa3f
[2] https://p.datadoghq.com/sb/be5bb4dc7-4a4339214a96eaf4bd75e8515953c4ab
[3] https://p.datadoghq.com/sb/68be64401-3b1e66c2845bacfbb8b965b9d853a882
[4] https://p.datadoghq.com/sb/09907bd64-75f6f514781dd3914ee963a30e5b4155

--
Andrew J Grimberg
Manager Release Engineering
The Linux Foundation

Join EdgeX-TSC-DevOps@lists.edgexfoundry.org to automatically receive all group messages.