New York improves water data quality with new machine learning technology

New York City pilot tests machine learning to improve data collection from water flow sites.

Nov. 19, 2024

7 min read

The water supply system for New York City provides drinking water to almost half the state’s population, which includes over 8.5 million people in the city and 1 million people in upstate counties. New York’s Catskill/Delaware System is one of the world’s most extensive unfiltered surface water supplies. The city’s water is supplied from a network of 19 reservoirs and three controlled lakes that contain a total storage capacity of approximately 570 billion gallons.

The reservoir levels are primarily determined by the balance between streamflow into the reservoirs, diversions (withdrawals) for water supply, and releases to maintain appropriate flows in the rivers below the dams.

Measuring water data

The New York City Department of Environmental Protection (NYCDEP) pulls sensor data from 445 locations every five minutes to monitor the safe flow of water. Some of these locations have multiple sensors measuring hydrological data such as water levels, dissolved oxygen, temperature, pH, turbidity, etc. All this data, as well as U.S. Geological Survey and National Weather Service information, is managed in Aquarius, an analytics software platform that water monitoring agencies around the world use to acquire, process, model, and publish water information in real-time. The data is used for operational modeling and daily awareness of what is happening in the system, so having reliable information is essential.

Data quality can be affected by various factors, such as faulty sensors, measurement errors, missing values, outdated information, or anomalies caused by maintenance, resulting in an unusual spike or a drop in data. Poor quality data in water flow management can lead to inaccurate predictions, inefficient operations, and water quality issues. To ensure reliable and effective water flow management, quality data that reflects the current and future conditions of the water system is essential.

New York’s Catskill Delaware system is one of the world’s most extensive surface water supplies.

Incorporating machine learning with water data in NYC

Aquatic Informatics, the company behind Aquarius, is working to incorporate a new machine learning QA/QC program that recognizes patterns of irregularity and suggests or automates corrections. The Aquarius team approached NYCDEP to see if they would be interested in doing a pilot project on a new machine learning QA/QC tool.

“We are working towards a paradigm shift from humans doing data correction to humans monitoring machines doing data correction,” Dirk Edwards, Aquatic Informatics account manager for the project, said. “With extensive knowledge of New York’s water supply and years of experience managing its data, NYCDEP was an ideal fit for piloting HydroCorrect™, enabling humans and machines to collaborate optimally.”

Having an automated data correction tool to help track and fix data issues would give staff more time to look beyond the critical points and known problem areas. Sometimes, staff know if a sensor is faulty or new equipment or processes are required in the field, but all upgrades need to be prioritized, and resources need to be in place to make the fix. This means, at times, they need to continue doing a job in less-than-ideal circumstances. Having a program that can autocorrect data based on this knowledge or anomaly saves time and provides a more accurate reflection of the situation.

Piloting HydroCorrect

The pilot program started with the onboarding of an 11-time series, including two problematic sites — Shandaken Tunnel Portal (STP), a diversion that had many spikes, and Rondout Effluent Chamber (REC), which had constant flatlines — both caused by faulty sensors that needed replacement.

Before HydroCorrect, STP had spikes every 2-3 hours. Staff worked with the Aquatic Informatics team to set rules for STP and only generate alerts outside of these parameters. Now, the site only has 2-3 spikes per week that generate alerts. Initially, the program would suggest corrections based on the set rules. Within two months, NYCDEP was confident that the program had mastered anomaly detection and changed the rules to auto-correct the data.

Similarly, HydroCorrect now automatically removes flatlines at REC based on elevation rules. Flatlining in water elevation data typically indicates that the data needs to be more accurate or complete, is incomplete, or constant for a long time, which is unlikely in natural systems. Staff were able to set a group rule to take out flatlining for all reservoirs because the water levels were either going up from precipitation or down from release or evaporation. The program now generates a smooth line between data readings, showing a gentle rise or fall in water levels as they occur.

Implementation on a larger scale

Everything was running smoothly in phase 1 of the pilot — data from the two main problem sites were automatically getting fixed, which saved a lot of time, so NYCDEP onboarded a further 60 sites with 200-time series four months later. Given the results, they decided to turn it on for all data shortly afterward.

When an alert for a turbidity spike on a diversion line went from 0.5 NTU to 17 NTU (anything above about 2 NTU is considered an issue), a call was made to the water quality department to see if work was being done on the line to cause such a spike. It turned out maintenance was being performed, so the spike was not a reflection of the water quality. The staff accepted the suggested corrections to the data. In addition to turbidity, the temperature and pH spiked. HydroCorrect will learn from this experience by taking all elements into account so that the next time it happens, it will alert staff of the anomaly so they can make a quick judgment call on correcting the data.

Before piloting the new software, NYCDEP received more alerts than the team could deal with. Alerts came up on a dashboard, and staff had to go through the data manually to determine whether the anomaly was an issue that required investigating or just poor data. This would take anywhere from 5-20 minutes per time-series.

At the start of the pilot, when the rules were being set, staff spent 3-4 hours a week combing through a handful of data sets. Now that the program has learned to differentiate between good and bad data, staff spend 20 -30 minutes a week reviewing the data sets. As rules are tweaked, more data will be fixed, further reducing review time.

Now, the alerts that are coming in are mostly catching things NYCDEP did not know were an issue; for instance, staff found negative flow in a couple of lines. They had to determine if this was a good or bad thing — does negative flow mean it is going backward, or does it simply mean there is no flow? If it is the latter, then a rule can be set to run all those negative reads as no flow so staff don’t get those constant alerts.

Good vs. bad data and moving forward

NYCDEP continues to work with the Aquatic Informatics support team to set rules on individual sites. Once staff are confident that HydroCorrect has learned the difference between good and bad data, they can switch from manual correction to autocorrect. Initially, NYCDEP was hoping for 50 to 75% of bad data to be removed; so far, they have received more, indicating the program is working well and is much needed.

The goal is to have a background system that cleanses all the data and only provides alerts for unusual occurrences that need human investigation. NYCDEP anticipates it will have a 1,000-time series using HydroCorrect and would like 90% of that data to be completely automated for corrections.

“Reducing personnel hours provides cost savings and improving data quality provides better reporting and decision making, but the software is also about reducing mundane work and enabling NYDEPs’ highly skilled workers to perform higher-value and more engaging tasks,” said Edwards.